# Full text of "Mathematical Methods Of Statistics"

## See other formats

MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMfiR rROFKSSOll IN THE UNIVERSITY OF STOCKHOLM PRINCETON PRINCETON UNIVERSITY PRESS First Published 1922 Printed in the United States in 1946 Second Printing: 194T Third Printing: 1947 Fourth Printing: 1949 Fifth Printing: 1961 Sixth Printing: 1964 Seventh I^rinting: 1967 To MARTA PREFACE. Duriniir the last 25 years, statistical science has made great pro- gress, thanks to the brilliant schools of British and American statis- ticians, among whom the name of Professor R. A. Fisher should be mentioned in the foremost place. During the same time, largely owing to the work of French and Russian mathematicians, the classical calculus of probability has developed into a purely mathematical theory satisfying modern standards with respect to rigour. The purpose of the present work is to join these two lines of de- velopment in an exposition of the mathematical theory of modern statistical methods, in so far as these are based on the concept of probability. A full understanding of the theory of these methods requires a fairly advanced knowledge of pure mathematics. In this respect, 1 have tried to make the book self-contained from the point of view of a reader possessing a good working knowledge of the elements of the differential and integral calculus, algebra, and analytic geometry. In the first part of the book, which serves as a mathematical in- troduction, the requisite mathematics not assumed to be previously known to the reader are developed. Particular stress has been laid on the fundamental concepts of a distribution, and of the integration with respect to a distribution. As a preliminary to the introduction of these concepts, the theory of Lebesgue measure and integration has been briefly developed in Chapters 4 — 5, and the fundamental concepts are then introduced by straightforward generalization in Cha])ters G — 7. The second part of tlie book contains the general theory of random variables and probability distributions, while the third part is devoted to the theory of sampling distributions, statistical estimation, and tests of significance. The selection of the questions treated in the last part is necessarily somewhat arbitrary, but 1 have tried to con- centrate in the first hand on points of general importance. When these are fully mastered, the reader will be able to work out appli- cations to particular problems for himself. In order to keep the volume VII of the book within reasonable limits, it has been necessary to exclude certain topics of great interest, which I had originally intended to treat, such as the theory of random processes, statistical time series and periodograms. The theory of the statistical tests is illustrated by numerical examples borrowed from various fields of application. Owing to con- siderations of space, it has been necessary to reduce the number of these examples rather severely. It has also been necessary to restrain from every discussion of questions concerning the practical arrange- ment of numerical calculations. It is not necessary to go through the first part completely before studying the rest of the book. A reader who is anxious to find him- self in mediae res may content himself with making some slight acquaintance with the fundamental concepts referred to above. For this purpose, it will be advisable to read Chapters 1 — 3, and the paragraphs 4.1— 4.2, 5.1— 5.3, 6.1— 6.2, 6.4-6.6, 7.1— 7.2, 7.4— 7.5 and 8.1 — 8.4. The reader may then proceed to Chapter 13, and look up the references to the first part as they occur. The book is founded on my University lectures since about 1930, and has been written mainly during the years 1942 — 1944. Owing to war conditions, foreign scientific literature was during these years only very incompletely and with considerable delay available in Swe- den, and this must serve as an excuse for the possible absence of quotations which would otherwise have been appropriate. The printing of the Scandinavian edition of the book has been made possible by grants from the Royal Swedish Academy of Science, and from Stiftelsen Lars Hiertas Minne. I express my gratitude to- wards these institutions. My thanks are also due to the Editors of the Princeton Mathema- tical Series for their kind offer to include the book in the Series, and for their permission to print a separate Scandinavian edition. I am further indebted to Professor E. A. Fisher and to Messrs Oliver and Boyd for permission to reprint tables of the t- and distributions from » Statistical methods for research workers». A number of friends have rendered valuable help during the preparation of the book. Professors Harald Bohr and Ernst Jacobsthal, taking refuge in Sweden from the hardships of the times, have read parts of the work in manuscript and in proof, and have given stimulating criticism and advice. Professor Herman Wold has made a very careful scrutiny of the whole work in proof, and I have greatly profited VIII from Us valuable remarks. Gosta Almqvist, Jan Jung, Sven G. Lind- blom and Bertil Mat4rn have assisted in the numerical calculations, the revision of the manuscript, and the reading of the proofs. To all these I wish to express my sincere thanks. IX Table of Contents First Part. MATHEMATICAL INTRODUCTION. Chapters 1 — 3. Sets of Points. P»ge Chapter 1. General properties of sets* 3 1. Sets. — 2. Subsets, space. — 3. Operations on sets. — 4. Sequences of sets. — 5. Monotone sequences. — 6. Additive classes of sets. Chapter 2. Linear point sets 10 1. Intervals. — 2. Various properties of sets in Hi - — 3. Borel sets. Chapter 3. Point sets in n dimensions 15 1. Intervals. — 2. Various properties of sets in Jtn. — 3. Borel sets. — 4. Linear sets. — 5. Sul)space, product space. References to chapters 1 — 3 18 Chapters 4 — 7. Theory of Measure and Integration IN J?,. Chapter 4. The Lebesgue measure of a linear point set 19 1. Length of an interval. — 2. Generalization. — 3. The uiea.sure of a sum of intervals. — 4. Outer and inner measure of a bounded set. — 6. Measurable sets and Lebesgue measure. — 6. The cla8.s of measurable sets. — 7. Mea- surable sets and Borel sets. Chapter 5. The Lebesgue integral for functions of one variable. 33 1. The integral of a bounded function over a set of linile measure. — 2. B' measurable functions. — 3. Troperties of the integral. — 4. The integral of an unbounded function over a set of finite measure. — 6. The integral over a set of infinite measure. — 6. The l..ebeague integral as an additive set function. ^ Chapter 6. Non-negative additive set functions in Rj 48 1. Generalization of the Lebesgue measure and the Lebesgue integral. — 2. Bet functions and point functions. — 3. Construction of a set function. — 4. P measure. — 6. Bounded sot functions. — 6. Distributions. — 7. Sequen- ces of distributions. — 8. A convergence theorem. XI Page Chapter 7. The Lebesgue-Stieltjes integral for functions of one variable 62 1. The iBtegral of a bounded function over a set of finite P-measurc. — 2. Unbounded functions and sets of infinite P-measure. — 3. Lebesgue-Stieltjes integrals with a parameter. — 4. Lebesgue-Stieltjes integrals with respect to a distribution. — 5, The Riemann-Stieltjes integral. References to chapters 4 — 7 75 Chapters 8 — 9. Theory of Measure and Integration IN Jl„. Chapter 8. Lebesgue measure and other additive set functions in il„ 76 1. Lebesgue measure in Rn. — 2. Non-negative additive set functions in Rn. — 8. Bounded sot functions. — 4. Distributions. — 6. Sequences of distri- butions. — 6. Distributions in a product space. Chapter 9. The Lebesgue-Stieltjes integral for functions of n variables 85 1. The Lebesgue-Stieltjes integral. — 2. Lebesgue-Stieltjes integrals with respect to a distribution, — 8. A theorem on repeated integrals. — 4. The Riemann-Sticltjcs integral. — 6. The Schw^arz inequality. Chapters 10 — 12. Various Questions. Chapter 10. Fourier integrals 89 1. The characteristic function of a distribution in R^. — 2. Some auxiliary functions. — 3. Uniqueness theorem for characteristic functions in R,. — 4. Continuity theorem for characteristic functions in R,. - - 6. Some particular integrals. — 6. 'rhe characteristic function of a distribution in Rn. — 7. Continuity theorem for characteristic functions in Rn. Chapter 11. Matrices, determinants and quadratic forms 105 1. Matrices. — 2. Vectors. — 3. Matrix notation for •: ■ lions. — 4. Matrix notation for bilinear and quadratic or\ o Detc.- miuauts. — 6. Rank. — 7. Adjugate and reciprocal matrices. — 8. Linear equations. — 9. Orthogonal matrices. Characteristic numbers. — 10. Non- negative quadratic forms. — 11. Decomposition of 2 .rf. — 12. Some inte- gral formulae. Chapter 12. Miscellaneous complements 122 1. The symbols 0, o and fv. — 2. The Euler-MacLaurin sum formula. — 3. The Gamma function. — 4. The Beta function. — 6. Stirling's formula. — 6. Orthogonal polynomials. XII Second Part. RANDOM VARIABLES AND PROBABILITY DISTRIBU- TIONS. Chapters 13 — 14. Foundations. Chapter 13. Statistics and probability 137 1. Kaodom experiments. — 2. Examples. — 8. Statistical regularity. — 4. Object of a mathematical theory. — 6. Mathematical probability. Chapter 14. Fundamental definitions and axioms 151 1. Random variables. (Axioms 1 — 2.) — 2. Combined variables. (Axiom 3.) — 3. Conditional distributions. — 4. Independent variables. — 6. Functions of random variables. — 6. Conclusion. Chapters 15 — 20. Variables and Distrihutions in R,. Chapter 15. General properties 166 1. Distribution function and frequency fiinclion. — 2. Two simple types of distributions. — 3. Mean values. — 4. Moments. — 6. Measures of location. — 6. Measures of dispersion. — 7. Tchebycbeff's theorem. — 8. Measures of skewness and excess. — 9. Characteristic functions. — 10. Semi-invari- ants. — 11. Independent variables. — 12. Addition of independent variables. Chapter 16. Various discrete distributions 192 1. The function e(j’). — 2. The binomial distribution. — 3. Bernoulli’s theorem. — 4. De Moivre's theorem. — 6. The Poisson distribution. — 6. The generalized binomial distribution of Poisson. Chapter 17. The normal distribution 208 1. The normal functions. — 2. The normal distribution. ~ 3. Addition of independent normal variables. — 4. The central limit theorem. — 6. Comple- mentary remarks to the central limit theorem. — 6. Orthogonal expansion derived from the normal distribution. — 7. Asymptotic expansion derived from the normal distribution. — 8. The role of the normal distribution in statistics. Chapter 18. Various distributions related to the normal 233 1. The x^'distribution. — 2. Student's distribution. — 8. Fisher’s z-distribu- tion. — 4. The Beta distribution. Chapter 19. Further continuous distributions 244 1. The rectangular distribution. — 2. Cauchy's and Laplace's distributions. — 3. Truncated distributions. — 4. The Pearson system. XllI Page 250 Chapter 20. Some convergence theorems 1. Convergence of distributions and variables. — 2. Convergence of certain distribntions to the normal. — 8. Convergence in probability. — 4. Tcbe- bycheff’s theorem. — 6. Khintchine*s theorem. — 6. A convergence theorem. Exercises to chapters 15 — ^20 255 Chapters 21 — 24. Variables and Distributions in R„. Chapter 21. The two'dimensional case 260 1. Two simple types of distributions. — 2. Mean values, moments. — 3. Cbaracteristic functions. — 4. Conditional distributions. — 6. Begression, I. — 6. Regression, II. — 7. The correlation coefficient. — 8. Linear trans- formation of variables. — 9. The correlation ratio and the mean square contingency. — 10, The ellipse of concentration. — 11. Addition of inde- pendent variables. — 12. The normal distribution. Chapter 22. General properties of distributions in Rn 291 1. Two simple types of distributions. Conditional distributions. — 2. Change of variables in a continuous distribution. — 3. Mean values, mo- ments. — 4. Cbaracteristic functions. — 6. Rank of a distribution. — 6. Linear transformation of variables. — 7. The ellipsoid of concentration. Chapter 23. Regression and correlation in n variables 301 1. Regression surfaces. — 2. Linear mean square regression. — 8. Residuals. 4. Partial correlation. — 6. The multiple correlation coefficient. — 6. Or- thogonal mean square regression. Chapter 24. The normal distribution 310 1. The characteristic function. — 2. The non-singular normal distribution. 3. The singular normal distribution. — 4. Linear transformation of nor- mally distributed variables. — 6. Distribution of a sum of squares. — 6. Conditional distribntions. — 7. Addition of independent variables. The cen- tral limit theorem. Exercises to chapters 21 — ^24 317 Third Part. STATISTICAL INFERENCE. Chapters 25 — ^26. Generalities. Chapter 25. Preliminary notions on sampling 323 1. Introductory remarks. — 2. Simple random sampling. — 8. The distribu- tion of the sample. — 4. The sample values as random variables. Sampling XIV Pag« distribiitioDS. — 5. Statistical image of a distribution. — 6. Biased sampling. Random sampling numbers. — 7. Sampling without replacement. The representative method. Chapter 26. Statistical inference 332 1. Introductory remarhs. — 2. Agreement between theory and facts. Tests of signidcance. — 8. Description. — 4. Analysis. — 5. Prediction. Chapters 27 — 29. Sampling Distributions. Chapter 27. Characteristics of sampling distributions 341 1. Notations. — 2. The sample mean x. — 8. The moments a^. — 4. The variance m,. — 5. Higher central moments and semi- in variants. — 6. Un- biased estimates. — 7. Functions of moments. — 8. Characteristics of multi- dimensional distributions. — 9. Corrections for grouping. Chapter 28. Asymptotic properties of sampling distributions . . 363 1. Introductory remarks. — 2. The moments. — 3. The central moments. — 4. Functions of moments. — 6. The quantiles. — 6. The extreme values and the range. Chapter 29. Exact sampling distributioncrT 378 1, The problem. — 2. Fisher's lemma. Degrees of freedom. — 8. The joint distribution of x and a* in samples from a normal distribution. — 4. Stu- dent's ratio. — 5. A lemma. — 6. Sampling from a two-dimensional normal distribution. — 7. The correlation coefficient. — 8. The regression coeffici- ents. — 9. Sampling from a /c-dimensional normal distribution. — 10. The generalized variance. — 11. The generalized Student ratio. — 12. Regression coefficients. — 18. Partial and multiple correlation coefficients. Chapters 30 — 31. Tests of Significance, I. Chapter 30. Tests of goodness of fit and allied tests 416 1. The ^ completely specified hypothetical distribu- tion. — 2. Examples. — 8. The x* test when certain parameters are estimated from the sample. — 4. Examples. — 6. Contingency tables. — 6. as a test of homogeneity. — 7. Criterion of differential death rates. — 8. Further tests of goodness of fit. Chapter 31. Tests of significance for parameters 452 1. Tests based on standard errors. — 2. Tests based on exact distributions. — 3. Examples. Chapters 32 — 34. Theory of Estimation. Chapter 32. Classification of estimates 473 1. The problem. — 2. Two lemmas. — 8. Minimum variance of an estimate. XV Pag© Efficient estimates. — 4. Sufficient estimates. — 6. Asymptotically efficient estimates. — 6. The case of two unknown parameters. — 7. Several unknown parameters. — 8. Generalization. Chapter 33. Methods of estimation 4^97 1. The method of moments. — 2. The method of maximum likelihood. — 3. Asymptotic properties of maximum likelihood estimates. — 4. The minimum method. Chapter 34. Confidence regions 507 1. Introductory remarks. — 2. A single unknown parameter. 8. The general case. — 4. Examples. Chapters 35 — 37. Tests of Significance, II. Chapter 35. General theory of testing statistical hypotheses . . . 525 1. The choice of a test of significance. — 2. Simple and composite hy- J potheses. — 3. Tests of simple hypotheses. Most powerful tests. — 4. Un- biased tests. — 6. Tests of composite hypotheses. Chapter 36. Analysis of variance 536 1. Variability of mean values. — 2. Simple grouping of variables. — 3. Generalization. — 4. Eandomized blocks. — 6. Latin squares. Chapter 37. Some regression problems 548 ). Problems involving non-random variables. — 2. Simple regression. — 3. Multiple regression. — 4. Further regression problems. Tables 1 — 2. The Normal Distribution 557 Table 3. The a:^-Distribution 559 Table 4. The i-Distribution 560 List of References 561 Index 571 XTI F I U S T P \ R T MATHEMATICAL INTRODUCTION •“454 Chapters 1 - 3 . Sets of Points. CHAPTER 1. General Properties of Sets. 1 . 1 . Sets. — In pure and applied inathematicB, situations often occur where we have to consider the collection of all possible objects having certain specified properties. Any collection of objects defined in this way will be called a set, and each object belonging to such a set Avill be called an element of the set. The elements of a set may be objects of any kind: points, num- bers, functions, things, persons etc. Thus we may consider e. g. 1) the set of all positive integral numbers, 2) the set of all points on a given straight line, 3) the set of all rational functions of two variables, 4) the set of all persons born in a given country and alive at the end of the year 1940. In the first part of this book we shall mainly deal with cases where the elements are points or numbers, but in this introductory chapter we shall give some considerations which apply to the general case when the elements may be of any kind. In the example 4) given above, our set contains a finite, though possibly unknown, number of elements, whereas in the three first examples we obviously have to do with sets where the number of elements is not finite. We thus have to distinguish between finite and infinite sets. An infinite set is called enumerable if its elements may be arranged in a sequence: . . ., Xn, . . such that a) every Xn is an element of the set, and b) every element of the set appears at a definite place in the sequence. By such an arrangement we establish a one-to-one correspondence between the elements of the given set and those of the set containing all positive integral numbers 1, 2, . . . , w, . . ., which forms the simplest example of an enumerable set. We shall see later that there exist also infinite sets which are n on-enumerable. If, from such a set, we choose any sequence of ele- ments x^y . . there will always be elements left in the set which do not appear in the sequence, so that a non-enumerable set may be 3 1.1-3 said to represent a higher order of infinity than an enumerable set. It will be shown later (cf 4. 3) that the set of all points on a given straight line affords an example of a non-enumerable set. 1.2. Subsets, space. — If two sets S and 5, are such that every element of also belongs to we shall call a subset of S', and write < S or S > Si. We shall sometimes express this also by saying that S', is contained in S or belongs to S. — When S^ consists of one single element x, we use the same notation x C S to express that x belongs to S. In the particular case when both the relations aS\ C S and S C S, hold, the sets are called e(jualy and we write S = Si. It is sometimes convenient to consider a set S which does not contain any element at all. This we call the empty set, and write S ■■== 0. The empty set is a subset of any set. If we regard the empty set as a particular case of a finite set, it is seen every subset o f a finite set is itself finite, while every subset of an enumerable set is finite or enumerable. Thus the set of all integers between 20 and 30 is a finite subset of the set 1, 2, 3, . . while the set of all odd integers 1, 3, 5, . . . is an enumerable subset of the same set. In many investigations we shall be concerned with the properties and the mutual relations of various subsets of a given set S. The set S, which thus contains the totality of all elements that may appear in the investigation, will then be called the space of the in- vestigation. If, e. g., we consider various sets of points on a given straight line, we may choose as our space the set 5 of all points on the line. Any subset S of the space S will be called briefly a set in S. 1.3. Operations on sets. — Suppose now that a space S is given, and let us consider various sets in S. We shall first define the opera- tions of addition, multiplication and subtraction for sets. The sum of two sets and S'*, S' = 6 ’, + is the set S' of all elements belonging to at least one of the sets /S, and S',. — The product 4 1.3 y ' == is the common part of the sets, or the set S'" of all elements belonging to both S\ and Sg. — Finally, the difference S'" - S, - Sg will be defined only in the case when Sg is a subset of S,, and is then the set S'" of all elements belonging fo Si but not to Sg. Thus if Si and Sg consist of all points inside the curves Ci and (7g respectively (cf T?ig. 1), S^ + will be the set of all points inside at least one of the two curves, while Sj Sg will be the set of all points common to both domains. The product S, Sg is evidently a subset of both xS\ and Sg. The difference Sn — Sj Sg, where n may denote 1 or 2, is the set of all points of Sn which do not belong to S| Sg. In the particular case when S^ and Sg have no common elements, the product is empty, so that we have S^ Sg == 0. On the other hand, if Si = Sg the difference is empty, and we have Si — Sg = 0. In the particular case when Sg is a subset of Si we have Si + Sg = Si and Si Sg == Sg. It follows from the symmetrical character of our definitions of the sum and the product that the operations of addition and multiplica- tion are commutative^ i. e. that we have Si + Sg = Sg -f Si and S, Sg = Sg Si. 5 1*3 Furthor, a moment’s reflection will show that these operations are also associative and distributive, like the corresponding arithmetic operations. We thus have (Sj + Si) + Sg = iSj + (Si + Sg), {S,Si)Sg = S,{SgSg), sASi-i- Sg) = s,Si + s,Sg. It follows that we may without ambiguity talk of the sum or product of any finite number of sets: "f 6*2 + * * ■ + Sn and ‘ ‘ where the order of terms and factors is arbitrary. We may even extend the definition of these two operations to an enumerable sequence of terms or factors. Thus, given a sequence Si, Sg, . . . of sets in S, we define the sum 00 as the set of all elements belonging to at least one of the sets while the product ns.=s,s. is the set of all elements belonging to all Sy. — We then have, e. g., 5(61 + 62 + - ■) = SSi + SS2 + - -. Thus if iSr denotes the set of all real numbers x such that x v , we V 4 - 1 V 00 find that 2 6* will be the set of all x such that 0 < .r _ 1. while the product set 1 OD will be empty, JJ ♦S*- = 0. — On the other' hand, if Sv denotes the set of all x such 1 OD 00 that 0 ^ a? ^ , the sum 2 coincide with Sj, while the product JJ Sv will 1 1 be a set containing one single element, yiz. the number x 0. For the operation of subtraction, an important particular case arises when Si coincides with the whole space S. The difference 6 1.3 4 S* = S ~ .S' is the set of all elements of our space which do not belong to S, and will be called the complementary set or simply the complement of S. We obviously have S S* = S, SS* = 0, and (.S*)* S. It is important to observe that the complement of a given set S is relative to the space S in which S is considered. If our space is the set of all points on a given straight line L, and if S is the set of all points situated on the positive side of an origin 0 on this line, the complement S* will consist of 0 itself and all points on the negative side of 0. If, on the other hand, our space consists of all points in a certain plane P containing L, the complement S* of the same set S will also include all points of P not belonging to L. — In all cases where there might be a risk of a mistake, we shall use the expression: 6’* is the complement of S with respect to S. The operations of addition and multiplication may be brought into relation with one another by means of the concept of complementary sets. We have, in fact, for any finite or enumerable sequence N,, nSg, . . . the relations (iS\ • •)* “ + aS* + • ' . The first relation expresses that the complementary set of a sum is the product of the complements of the terms. This is a direct coTisequence of the definitions. As a matter of fact, the complement (S^ 4 • )* is the set of all elements x of the space, of which it is not true that they occur in at least one S^. This is, however, the same thing as the set of all elements ./■ which are absent from every aS\, or the set of all X which belong to every complement SJ, i. e. the product aV* si • . The second relation is obtained from the first by substituting S* for Si,. — For the operation of subtraction, we obtain by a similar argument the relation (1.3.2) = The reader will find that the understanding of relations such as (1.3.1) and (1.3.2) is materially simplified by tliQ use of figures of the same type as Fig. 1. 1.4. Sequences of sets. — When we use the word sequence without further specification, it will be understood that we mean a finite or 1.4 enumetahle sequence. A sequence • • • will often *be briefly called the sequence {Sn}. When we are concerned with the sum of a sequence of sets + S, -f it is sometimes useful to be able to represent S as the sum of a sequence of sets such that no two have a common element. This may be effected by the following transformation. Let us put • s:-iSr, Thus Zv is the set of all elements of 5,- not contained in any of the preceding sets Sv-i, It is then easily seen that and Zr have no common element, as soon as Suppose e. g. fx<v\ then Zfi is a subset of while Z^ is a subset of S*, so that Zf, Z, = 0 . Let us now put S = Z^ 4- + . Since Zv <. *SV for all r, we have S* < S. On the other hand, let x denote any element of S. By definition, x belongs to at least one of the Sr. Let Sn be the first set of the sequence S^, . . . that contains x as an element. Then the definition of Zn shows that x belongs to Zn and consequently also to S\ Thus we have both S cr and S' c: S, so that S' — S and s ^ Z^ + Z^ + ' • ' . We shall use this transformation to show that the sum of a seque^ice of enumerable sets is itself enumerable. If Sy is enumerable, then Zr as a subset of Sr mast be finite or enumerable. Let the elements of Zv be XruXri, .... Then the elements of S 2 Sy = I Zr form the double sequence ^ii ^12 ^21 *^*22 ^23 ^81 ^82 ■'^33 8 1.4-5 and these may be arranged in a simple sequence e. g. by reading along diagonals: a;,^, ccj,, .... It is readily seen that every element of S appears at a definite place in the sequence, and thus S is enumerable. 1.5. Monotone sequences. — A sequence Si,S 2 , . . . is never de- creasing, if we have Sn < Sn+i for all n. If, on the contrary, we have Sn > 5n+i for all w, the sequence is never increasing. With a common name, both types of sequences are called monotone. For a never decreasing infinite sequence, we have n 5 „ = 1 and this makes it natural to define the limit of such a sequence by writing 00 lim Sn = 2 00 Similarly, we have for a never increasing sequence I and accordingly we define in this case 00 lim <5^11 = JJ Sv. n -► 00 ^ Thus if Sn deootes the set of all points (x, jy, z) inside the sphere x* + y* -f H- z* == 1 — the sequence Sx, . will be never decreasing, and lim Sn will be n the set of all points inside the sphere x* + y* + z* = 1. On the other hand, if Sn denotes the set of all points inside the sphere x* + j/* + z* = 1 + - , the sequence will be n never increasing, and lim Sn will consist of all points belonging to the inside or the surface of the sphere x* + y* + z* = 1. It is possible to extend the definition of a limit also to certain types of sequences that are not monotone. We shall, however, have no occasion to use such nn extension in this book. 9 1 . 6 - 2.1 1.6. Additive classes of sets. — Given a space S, we may consider various classes of sets in 5. We shall make an important use of the concept of an additive class of sets in S. A class ® ot sets in S will be called additive^), if it satisfies the following three conditions: a) The whole space S belongs to S. b) If every set of the sequence S 2 , . . . belongs to £, then the sum Sj + Sj + • • and the product Sj . . . both belong to S. c) If Si and belong to S, and then the diflEerence Si — Sg belongs to E If (S is an additive class, we can thus perform the operations of addition, multiplication and subtraction any finite or enumerable number of times on members of E without ever encountering a set that is not a member of E. It may be remarked that the three above conditions are evidently not independent of one another. As a matter of fact, the relations (1.3.1) and (1.3.2) show that the following is an entirely equivalent form of the conditions: aj) The whole space S belongs to E. b, ) If every set of the sequence Sj, Sg, . . . belongs to E, then the sum S', H- /S, + • •• belongs to E. c, ) If S belongs to S, then the complementary set S* belongs to E. The name » additive class » is due to the important place which, in this form of the conditions, is occupied by the additivity condition b,). The class of all possible subsets of S is an obvious example of an additive class. In the following chapter we shall, however, meet with a more interesting case. CHAPTER 2. Linear Point Sets, 2.1. Intervals. — Let our space be the set jR, of all points on a given straight line. Any set in will be called a linear point set *) In this book, we shall always use the word ^additive- in the same sense as in this paragraph, i.e. with reference to a finite or enumerable sequence of terms. It may be remarked that some authors use in this sense the expression ^completely additive*, while ^additive* or ^simply additive is used to denote a property essent- iftlly restricted to a finite number of terms. 10 2.1 If we choose on our line an origin 0, a unit of measurement and a positire direction, it is well known that we can establish a one-to-one correspondence between all real numbers and all points on the line. Thus we may talk without distinction of a point x on the line or the real number x that corresponds to the point. We consider only points corresponding to finite numbers; thus infinity does not count as a point. A simple case of a linear point set is an interval. If a and h are any points such that we shall use the following expressions to denote the set of all x such that: a ^ X ^ by . . . the closed interval (u, b)\ a < X < by . . . the open interval (a, b)\ a < X ^ by . , . the half-open interval (a, 6), closed on the right; a ^ X < by . . . the half-open interval (ay 6 ), closed on the left. When we talk simply of an interval (a, 6) without further specification in the context, it will be understood that anything that we say shall be true for all four kinds of intervals. In the limiting case when a = by we shall say that the interval is degenerate. In this case, the closed interval reduces to a set con- taining the single point x = ay while each of the other three intervals is empty. If, in the above inequalities, we allow b to tend to H- oo, we obtain the inequalities defining the closed and the open infinite inter- val (a, -h Qo) respectively: x^ a and x> a. Similarly when a tends to — oo we obtain X ^b and x < b for the closed and the open infinite interval ( — — Finally, the whole space may be considered as the infinite interval (— co^oo). It will be shown below (cf 4.3) that any non-degenerate interval is a non-enumerable set. The product of a finite or enumerable sequence of intervals is always an interval, but the sum of two intervals is generally not an interval. In order to give an example of a case when a sum of intervals 11 2 . 1-2 is another interval, we consider n •¥ \ points a < • * • < Xn~\ < b. If all intervals appearing in the following relation are half open and closed on the same side, we obviously have (a, h) = (a, Xi) -f (xj, a:,) + h (xn-i, 6), and no two terms in the second member have a common point. The same relation holds if all intervals are closed, but in this case any two consecutive terms have precisely one common point. If all inter- vals are open, on the other hand, the relation is not true. 2.2. Various properties of sets In Hi. — Consider a non-empty set S. When a point a exists such that, for any £ > 0, there is at least one point of S in the closed interval (or, a + t), while there is none in the open interval ( — oo, a), we shall call or the lower bound of S, When no finite a with this property exists, we shall say that the lower bound of S is — Qo . In a similar way we define the upper hound of S. A set is hounded y when its lower and upper bounds are both finite. A bounded set S is a subset of the closed interval (o, /9). The points a and ^ themselves may or may not belong to S. If £ is any positive number, the open interval {x — £, x e) will be called a neighbourhood of the point x or, more precisely, the e-neigh' bourhood of x. A point z is called a limiting point of the set S if every neigh- bourhood of z contains at least one point of S different from z. If this condition is satisfied, it is readily seen that every neighbourhood of z even contains an infinity of points of S. The point z itself may or may not belong to The Bolzano- Weierstrass theorem asserts that every hounded infinite set has at least one limiting point. We assume this to be already known. — If ^ is a limiting point, the set S always contains a sequence of points a\y x^y . . . such that Xn—^s: as n ^ cc> . A point X of S is called an inn^r point of S' if we can find e such that the whole £-neighbourhood of x is contained in S. Obviously an inner point is always a limiting point. We shall now give some examples of the concepts introduced above. — In the first place, let S be a finite non-degenerate interval (o, h). Then a is the lower bound and b is the upper bound of 8. Every point belonging to the closed interval (a, h, is a limiting point of S, while every point belonging to the open interval (a, b) is an inner point of S. 12 2.2-3 Consider now the set R of all rational points x — p q belonging to the half-open interval 0 < x 1. If we write the sequence ft I !{ 2» 1 a* A* and then discard all numbers p/q such that p and q have a common factor, every point of R will occur at precisely one place in the sequence, and hence R is enumerable. There are no inner points of R. Every point of the closed interval (0, 1) is a limiting point. — The complement R* of R with respect to the half open interval 0 < x 1 is the set of all irrational points contained in that interval. R* is not an enumerable set, as in that case the interval (0, 1) would be the sum of two enumerable sets and thus itself enumerable. Like R itself, R* has no inner points, and every point of the closed interval (0, 1) is a limiting point. Since R is enumerable, it immediately follows that the set Rn of all rational points X belonging to the half-open interval n < x ^ n + 1 is, for every positive or negative integer n, an enumerable set. From a proposition proved in 1.4 it then follows that the set of all positive and negative rational numbers is enumerable. The latter set is, in fact, the sum of the sequence {Rn}* where n assumes all positive and negative integral values, and is thus by 1.4 an enumerable set. 2.3. Borel sets. — Consider the class of all intervals in Hj — closed, open and half-open, deg^enerate and non-degenerate, finite and infinite, including in particular the whole space Kj itself. Obviously this is not an additive class of sets as defined in 1.6, since the sum of two intervals is generally not an interval. Let us try to build up an additive class by associating further sets to the intervals. As a first generalization we consider the class 3 of all point sets I such that I is the sum of a finite or enumerable sequence of intervals. If /i, /a, . . . are sets belonging to the class 3, the sum /j + /g ■ is, by 1.4, also the sum of a finite or enumerable sequence of inter- vals, and thus belongs to 3. The same thing bolds for any finite product /i /g . . . /«, on account of the extension of the distributive property indicated in 1.3. We shall, however, show by examples that neither the infinite product . nor the difference — /, neces- sarily belongs to 3. In fact, the set R considered in the preceding paragraph belongs to 3, since it is the sum of an enumerable sequence of degenerate intervals, each containing one single point p/q. The difference (0, 1) — E, on the other hand, does not contain any non- degenerate interval, and if we try to represent it as a sum of degenerate 13 2.3 interralSf a non-enurnerable set of such intervals will be required. Thus the difference does not belong to the class rv^s. Further, this difference set may also be represented as a product /j 7jj . . where In denotes the difference between the interval (0, l) and the set con- taining: only the w:th point of the set 11. Thus this product of sets in Z does not itself belong to the class S- Though we shall make in Ch. 4 an important use of the class it is thus clear that for our present purpose this class is not sufficient. In order to build up an additive class, we must associate with further sets of a more general character. If we associate with all sums and products of sequences of sets in 3, and all differences between two sets in such that the difference is defined — some of which sets are, of course, already included in ^ — we obtain an extended class of sets. It can, however, be shown that not even this extended class will satisfy all the conditions for an additive class. We thus have to repeat tlie same process of asso- ciation over and over again, without ever coming to an end. Any particular set reached during this process has the property that it can be defined by starting from intervals and performing the opera- tions of addition, multiplication and subtraction a finite or enumerable number of times. I' he totality of all sets ever reached ia this way is called the class JsBi of Borel sets ia and this is an additive cUfss. As a matter of fact, every given Borel set can be formed as described by at most an enumerable number of steps, and any sum, product or difference formed with such sets will still be contained in the class of all sets obtainable in this way. Thus any sum, ])roduct or difference of Borel sets is itself a Borel set. In particular, the limit of a monotone sequence (cf 1.5) of Borel sets is always a Borel set. On the other hand, let ^ be any additive class of sets in Rj con- taining all intervals. It then follows directly from the definition of an additive class that (i must contain every set that can be obtained from intervals by any finite or enumerable repetition of the o})erations of addition, multiplication and subtraction. Thus ii must contain the whole class of Borel sets, and we may say that the class 3^, is the smallest additive class of sets ///’Rj that includes all intervals. 14 3.1 CHAPTER 3. Point Sets in n Dimensions. 3.1. Intervals. — Just as we may establish a one-to-one corres- pondence between all real numbers x and all points on a straight linOy it is well known that a similar correspondence may be established between all pairs of real numbers (xj, a:g) and all points in a plane, or between all triplets of real numbers Xg) and all points in three-dimensional space. Generalizing, we may regard any system of n real numbers (xi, Xg, . . ., Xn) as representing a point or vectoi' x in an euclidean space Rn of n dimensions. The numbers x^, . . . , Xn are called the coordinates of X. As in the one-dimensional case, we consider only points cor- responding to finite values of the coordinates. — The distance be- tween two points * = (x,, , . a;„) and y = (yi, . . y.) is the non-negative quantity \x—y\ = V(xi~ t/if ^ • • • + (a:„ — y„)“. The distance satisfies the triangular inequaliUj: !*— y|^ 1* — »| + |y — *1- Let 2 n numbers a^, . . an and . . ., hn be given, such that av ^ for y — 1, . . ., w. The set of all points * defined by a^ ^Xv^bv for V = 1, . . ., M is called a closed n dimensional interval. If all the signs ^ are replaced by <, we obtain an open interval, and if both kinds of signs occur in the defining inequalities, we have a half open inter- val. In the limiting case when a^ = b^ for at least one value of Vy the interval is degenerate. When one or more of the a^ tend to — oo , or one or more of the 6^ to -I- oo, we obtain an interval. As in 2.1, the whole space Rn may be considered as an extreme case of an infinite interval. It will be shown below (cf 4.3) that any non-degenerate interval is a non-enumerable set. The product of a finite or enumerable sequence of intervals is always an interval, but the sum of two intervals ia generally not an interval. 15 3.2-4 3.2. Various properties of sets in /In. — A set S in Rn is hounded ^ if all points of S are contained in a finite interval. If a == (a,, . . a„) is a given point, and ^ is a positive number, the set of all points x such that | * — a | < £ is called a neighbourhood of a or, more precisely, the e- neighbourhood of a. The definitions of the concepts of limiting point and innei' •pointy and the remarks made in 2.2 in connection with these concepts for the case w==l, apply without modification to the general case here considered. We have seen in 2.2 that the set of all rational points in is enumerable. By means of 1.4 it then follows that the set of all points with rational coordinates in a plane is enumerable, and further by induction that the set of all points in Rn voith rational coordinates is enumerable. 3,3. Borel sets. — The class of all intervals in jR„ is. like the corresponding class in H,, not an additive class of sets. In order to extend this class so as to form an additive class we proceed in the same way as in the case of intervals in /l|. Thus we consider first the class 3^ of all sets I that are sums of finite or enumerable sequences of intervals in Jl„. If 7g, . . . are sets belonging to this class, the sum Jj + Jg + • and the finite pro- duct 1^1^ ... In also belong to Sn. As in the case 1, the infinite product /j /g . . . and the difference /j — /g do not, however, always belong to ijn. We thus extend the class !3 h by associating all sums, products and differences formed by means of sets in 5„. Repeating the same as- sociation process over and over again, we find that any particular set reached in this way has the property that it can be defined by start- ing from intervals and performing the operations of addition, multi- plication and subtraction a finite or enumerable number of times. The totality of all sets ever reached in this way is called the class ®„ of Borel sets in Rny and this is an additive class. In the same way as in the case « = 1 , we find that the class ®n is the smallest additive class of sets in Jl„ that includes all intervals. 3.4. Linear sets. — When w > 3, the set of all points in Rn which satisfy a single equation F(x,, . . ., = 0 will be called a hypersurface. When F is a linear function, the hypersurface becomes a hyperplane. The equation of a hyperplane may always be written in the form — m,) -f • ^ an{Xn — mn) = 0, 16 3.4-5 where m = (wij, . . m,,) is an arbitrary point of the hyperplane. — Let (3.4.1) Hi = aii[x^ — ?W,) + h Qin (Xn — Wn) == 0, where e = 1, 2, . . ., p, be the equations of p hyperplanes passing through the same point m. The equations (3.4.1) will be called linearly independent^ if there is no linear combination Ic^H^ + • • • + Hp with constant h not all == 0, which reduces identically to zero. The cor- responding hyperplanes are then also said to be linearly independent. Suppose p < n, and consider the set L of all points in Rn common to the p linearly independent hyperplanes (3.4.1). If (3.4.1) is con- sidered as a system of linear equations with the unknowns x^y . . .. Xny the general solution (cf 11.8) is Xt nii “i“ Ci\t^ "I" * * * H" Ci^ n—p tn—py where the ca are constants depending on the coefficients while <j, . . tn^p are arbitrary parameters. The coordinates of a point of the set L may thus be expressed as linear functions of w — arbitrary parameters. Accordingly the set L will be called a linear set of n — p dimensionSy and will usually be denoted by Ln-p- For p = 1, this is a hyperplane, while for p = n-“2 L forms an ordinary plane, and for p^ n — 1 a straight line. — Conversely, if Ln-p is a linear set of n—p dimensions, and if m = (mi, . . ?Wn) is an arbitrary point of Ln-p, then Ln-p may be represented as the common part (i. e. the product set) of p linearly independent hyperplanes passing through m. 3.5. Subspace, product space. — Consider the space Rn of all points X — (xj, . . ., Xn). Let us select a group of A < n coordinates, say Xj, . . ., Xky and put all the remaining n — h coordinates equal to zero: Xk^i= - = Xn = 0. We thus obtain a system of w — i linearly independent relations, which define a linear set Lu of k dimensions. This will be called the i-dimensional subspace corresponding to the coordinates Xj, . . ., Xjt. The subspace corresponding to any other group of k coordinates is, of course, defined in a similar way. Thus in the case w ~ 3, jfc = 2, the two-dimensional subspace corresponding to x^ and Xa is simply the (xi, X 2 )-plane. Let S denote a set in the ilr-dimensional subspace of x^, . . ., Xk* The set of all points x in Rn such that (xj, . . ., xjt, 0, . . ., 0) < S will be called a cylinder set with the base S. — In the case w = 3, ^ == 2, 17 2— 4fi4 H. Cramir 3.5 this is an ordinary three-dimensional cylinder in the {xi, x^, a:a)-space, having the set S in the (xijXj)- plane as its base. Further, if and S, are sets in the subspaces of x* and xjfc+i, . . Xn respectively, the set of all points * in Rn such that (x^, . . X*, 0, . . 0) C and (0, 0, Xk+u • • •, ^n) C will be called a rectangle set with the sides and S^. — In the case when w = 2, while /Sj and are one-dimensional intervals, this is an ordinary rectangle in the (xj, Xjj-plane. Finally, let Rm and Rn be spaces of m and n dimensions respectively. Consider the set of all pairs of points (x, y) where x = (x^, . . Xm) is a point in Rm, while y ^ yn) is a point in Rn. This set will be called the product space of Rm and Rn. It is a space of m + w dimensions, with all points (xj, . . ., Xm, yj, . . yn) as its elements. — Thus for m — w = 1, we find that the (x^, x^j-plane may be regarded as the product of the one-dimensional Xj- and x^-spaces. For m — 2 and w = l, we obtain the (xj, .r^j, xj-space as the product of the (xj, xj-plane and the one-dimensional x^-space, etc. The extension of the above definition to product spaces of more than two spaces is obvious. (Note that the product space introduced here is something quite difEerent from the product set defined in 1.3.) References to chapters 1—3. — The theory of sets of points was founded by G. Cantor about 1880. It is of a fundamental importance for many branches of mathematics, such as the modern theory of integration and the theory of functions. Most treatises on these subjects contain chapters on sets of points. The reader may be referred e. g. to the books by Bore) (Ref. 8) and de la Vallee Poussin (Ref. 40j. 18 Chapters 4-7. Theory of Measure and Integration in R ^. CHAPTER 4. The Lebesgue Measure of a Linear Point Set. 4.1. Length of an interval. — The length of a finite interval {a, h) in is the non-negative quantity ft — a. Thus the length has the same value for a closed, an open and a half-open interval with the same end-points. For a degenerate interval, the length is zero. The length of an infinite interval we define as 4- oo. Thus with every interval i = (a, ft) we associate a definite non- negative length, which may be finite or infinite. We may express this by saying that the length L(?‘) is a non-negative function of the interval and writing L (/) ^ h — a, or L (/) = + 00 , according as the interval i is finite or infinite. If an interval t is the sum (cf 2.1) of a finite number of intervals, no two of which have a common point: / = + • • • + in [ifi iv = 0 for pi 7 ^ y), the length of the total interval i is obviously equal to the sum of the lengths of the parts: L [i] = L (ii) -h L (i^ -f • • • + Zr (?«)• We now propose to show that this relation may he extended to an enumer- able sequence of parts. To a reader who studies the subject for the first time, this will no doubt seem trivial. A careful study of the fol- lowing proof may perhaps convince him that it is not. — In order to give a rigorous proof of our statement, we shall require the fol- lowing important proposition known as BoreVs lemma: We are given a finite closed interval (a, ft) and a set Z of intervals such that eveiy point of {a, ft) is an innet' point of at least one interval 19 4.1 belonging to Z. Then there is a subset Z' of Z containwg only a finite number of intenmls, such that every point of {a, b) is an inner point of at least one interval belonging to Z\ Divide the interval [a, b) into n parts of equal length. The lemma will be proved, if we can show that it is possible so to choose n that each of the n parts — considered as a closed interval — is entirely contained in an interval belonging to Z. Suppose, in fact, that this is not possible, and denote by in the first of the n parts, starting from the end-point a, which is not en- tirely contained in an interval belonging to Z. The length of in ob- viously tends to zero as n tends to infinity. Let the middle point of in be denoted by Xn , and consider the sequence .... Since this is a bounded infinite sequence, it has b}" the Bolzano-Weierstrass theorem (cf 2.2) certainly a limiting j)oint x. Every neighbourhood of the point x then contains an interval /«, which is not entirely con- tained in any interval belonging to Z. On the other hand, x is a point of (rt, b) and is thus, by hypothesis, itself an inner point of some interval belonging to Z. This evidently implies a contradiction, and so the lemma is proved. • It is evident that both the lemma and the above proof may be directly generalized to any number of dimensions. Let us now consider a sequence of intervals = //,) such that the sum of all iv is a finite interval i = ft), while no two of the /, have a common point: 00 i “ ^ /• (/,( = 0 for II Z- v), 1 We want to prove that the corre.s])onding relation holds for the lengths. (Ll.l) 2 /,(/,). 1 In the first place, the n intervals /j, . . are a finite number of u intervals contained in /, so th^t we have ^ ^ and hence, allowiiiff n to tend to infinity, ’ 1 It remains to prove the opposite inequality. This is the non-trivial part of the proof. 20 4.1-2 Consider the set Z which consists of the following intervals: l)the intervals u, 2) the open intervals (« — €,« + «) and (6 — £,/> + e), 3) the open intervals |av “ Or + |;| and ^6, — 1, , where r = 1, 2, . . while £ is positive and arbitrarily small. It is then evident that every point of the closed interval h) is an inner point of at least one interval belonging to Z. According to Borel’s lemma we may thus entirely cover i by means of a finite number of intervals belonging to Z^ and the sum of the lengths of these intervals will then certainly be greater than L(?) = h — a. The sum of all intervals belonging to Z will a fortiori be greater than L(^), so that we have 2 i (<■..) + 4e + 4_2'2» = 2-^ + 8 £ > L (/■) . 1 1 1 Since e is arbitrary, it follows that 1 and (4.1.1) is proved. It is further easily proved that (4.1.1) holds also in the case when / is an infinite interval. In this case, we have 7>(^) = ^- co, and if is any finite interval contained in i, it follows from the latter part of the above proof that we have 1 Since i is infinite we may, however, choose /q such that L (e,)) is greater than any given quantity, and thus (4.1.1) holds in the sense that both members are infinite. We have thm proved that, if an interval is divided into a finite or enumerable number of intervals without common points, the length of the total interval is equal to the sum of the lengths of the parts. This pro- perty trill be expressed by saying that the length X(^) is an additive function of the interval i. 4.2. Generalization. — ■ The length of an interval is a measure of the extension of the interval. We have seen in the preceding para- graph that this measure has the fundamental properties of being non- negative and additive. The length of an interval i is a noivnegatire 21 4J-3 and additive interval function L{i). The ralne of this function may be finite or infinite* W^e now ask if it is possible to define a measure with the same fundamental properties also for more complicated sets than interTals. With any set S belongfing to some more or less general class, we thus want to associate a finite or infinite^) number L[S), the measure of 5, in such a way that the following three conditions are satisfied! a) L[S)^0, b) If S' = S, + S, + • • , where S^ Sv = 0 for /x y, then we have L(S) = L(S,) + L[S^) + c) In the particular case when S is an interval, L(S) is equal to the length of the interval. Thus we want to extend the definition of the interval function L (i), so that we obtain a non-negative and additive set function £(S) which, in the particular case when S is an interval i, coincides with L(i). It might well be asked why this extension should be restricted to »some more or less general class of sets», and why we should not at once try to define L{S) for every set S. It can, however, be shown that this is not possible. We shall accordingly content ourselves to show that a set functioji L{S) with the required properties can he de- fined for a class of sets that includes the whole class ©i of Bm'el sets. This set function L{S) is known as the Lehesgue measure of the set S. We shall further show that the extension is unique 09% more precisely, that L (S) is the only set function which is defined for all Boi'el sets and satisfies the conditions a) — c), 4.3. The measure of a sum of intervals. — We shall first define a measure L[I) for the sets I belonging to the class 3 considered in 2.3. Every set in 3 is the sum of a finite or enumerable sequence of intervals and, by the transformation used in 1.4, we can always take these intervals such that no two of them have a common point. (In fact, if the sets Sv considered, in 1.4 are intervals, every Zv will be the sum of a finite number of intervals without common points.) Any set in 3 may thus be represented in the form (4.3.1) i — -f -f *) For the set function Tt (S), and the more general set functions considered in Ch. we shall admit the existence of infinite values. For sets of points and for ordinary functions, on the other hand, we shall only deal with infinity in the sense of a limit, but not as an independent point or value (cf 2.1 and 8.1). 22 4.3 where the n are intervals such that u = 0 for v. By the condi- tions b) and c) of 4.2, we must then define the measure L(I) by writing (4.3.2) i(/) = L(i,)4- L(0+ . • where as before L(u) denotes the length of the interval ^r. The representation of 1 in the form (4.3.1) is, however, obviously not unique. Let (4.3.3) ^ = + js + • be another representation of the same set /, the > being intervals such that jfijv = 0 for 7 ^ v. We must then show that (4.3.1) and (4.3.3) yield the same value of L (/), i. e. that (4.3.4) *=1 This may be proved in the following way. For any interval 4* we have, since ift = I = == ^ and thus, by the additive property of the length of an interval, L (/ft) “ 2 r (4.3.5) 2 (v) ~ 2 2 ^ (v.^)- /.i /t V In the same way we obtain (4.3 6) 2 "^22^ (v>) • r V /t* Now the following three cases may occur: 1) The intervals are all finite, and the double series with non negative terms is convergent. 2) All the are finite, and the double series is divergent. 3) At least one of the i/^jv is an infinite interval. In case 1), the expressions in the second members of (4.3.5) and (4.3.6) are finite and equal, and thus (4.3.4) holds. In cases 2) and 3) the same expressions are both infinite. Thus in any case (4.3.4) is 23 4.3 proved, and it follows that the definition (4.3.2) yields a uniquely de- termined — finite or infinite — value of L{iy It is obvious that the measure L(I) thus defined satisfies the condi- tions a) and c) of 4.2. It remains to show that condition b) is also satisfied. Let . be a sequence of sets in 3, such that A = 0 for p V, and let be a representation of in the form used above. Then /U It V is also a set in 3, and no two of the have a common point. If ... is an arrangement of the double series ^ ^ simple se- quence (e. g. by diagonals as in 1.4), we have T=i' 4. L{I)=^L{i') + L{i") + -- . A discussion of possible Cases similar to the one given above then shows that we always have ft V ft We have thus proved that (4.3.2) defines for all sets I belonging to the class 3 « uniqm measure L(I) satisfying the conditions a) — c) of 4.2. We shall now deduce some properties of the measure L(I). In the first place, we consider a sequence /g, . . . of sets in 3, without assuming that and have no common points. For the sum ^ + I* H , we obtain as above the representation 7= t' + h ^ but the intervals i\ . . . may now have common points. By the transformation used in 1.4 it is then easily seen that we always have L(/)^L(o + i(n + which gives (4.3.7) L(/. +/, + •••)< L(/,) + Z(/,) +••• . (In the particular case when 7. = 0 for v, we have already seen that the sign of equality holds in this relation.) 24 4.3 We further observe that any enumerable set of points ... is a set in % since each Xn may be regarded as a degenerate interval, the length of which reduces to zero. It then follows from the de- finition (4.3.2) that the measure of an enumerable set is always equal to zero, ^ Hence we obtain a simple proof of a property mentioned above (1.1 and 2.1) without proof: the set of all points belonging to a non-degenerate interval is a non-enumerable set. In fact, the measure of this set is equal to the length of the interval, which is a positive quantity, while any enumerable set is of measure zero. A fortiori, the same property holds for a non-degenerate interval in JR„ with « > 1 (cf 3.1). rinally, we shall prove the following theorem that will be required in the sequel for the extension of the definition of measure to more general classes of sets : If I and J are sets in 3 that are both of finite measure^ we have (4.3.8) L(I+ J)^L(I) + L{J)-L{IJ), Consider first the case when I and J both are sums of a finite number of intervals. From the relations we obtain, since all sets belong to 3, and the two terms in each sec- ond member have no common point, J) = L(J) + L[J-IJ), L{J) = L[IJ) + L(J-IJ), and then by subtraction we obtain (4.3.8). In the general case, when I and J are sums of finite or enumerable sequences of intervals, we cannot argue in this simple way, as we are not sure that J — IJ is a set in 3 (cf 2.3) and, if this is not the case, the measure L[J^IJ) has not yet been defined. Let OD 00 be representations of I and J of the form (4.3.1), and put n n In ~ ifi^ In ” 25 4.3-4 According to the above, we then have L[ln + f/n) = -L (/w) + L [J r;) h [In J r?) • Allowing now n to tend to infinity, each term of the last relation tends to the corresponding term of (4.3.8), and thus this relation is proved. 4.4. Outer and inner measure of a bounded set. — In the pre- ceding paragraph, we have defined a measure L[l) for all sets I be- longing to the class S- In order to extend the definition to a more general class of sets, we shall now introduce two auxiliary functions, the inner and outer measure, that will be defined for every hounded set in R^. Throughout this paragraph, we shall only consider bounded sets. We choose a fixed finite interval [a, h) as our space and consider only points and sets belonging to (a, h). When speaking about the com- plement 6'* of a set S, we shall accordingly always mean the com- plement with respect to (a, h). (Cf 1.3.) In order to define the new functions, we consider a set 1 belonging to the class such that C J c [a, h). Thus we enclose the set i.S in a sum 1 of intervals, which in its turn is a subset of [a, h). This can always be done, since we may e. g. choose 1 = [a, h). The enclosing set I has a measure L[l) defined in the preceding paragraph. Consider the set formed by the numbers Ij[I) corresponding to all possible en- closi)ig sets 1. Obviously this set has a finite lower bound, since we have L(I) 0. The outer measure Tj[S) of the set S u'ill be defined as the louver bound of the set of all these nu?7ibers L[I). The iinier measure L[S) of S will be defined by the relation L(-S) — fc-r/ — Z (^’*). Since every set 8 considered here is a subset of the interval [a, b), which is itself a set in we obviously have 0 L (S) ^ It^ a, 0 ^ L [S) -S b a. Directl}^ from the definitions we further find that L{S) and L(S) are both monotone functions of S, i. e. that we have (4-4. 1 ) L (S,) g L (6g, /. (S.) g L (6g, as soon as 6’, < In fact, for any I such that < I, we then also have (S’j < /, and hence the first inequality follows immediately. The 26 4.4 second inequality is obtained from the first by considering the com- plementary sets. Further, if S < and S* C every point of (a, b) belongs to at least one of the sets /j and Since and /g are both contained in (a, b), we then have 1^ + 1^=- (a, b) and thus by (4.3.7) i(/.) + Mh) ^ b-a. Choosing the enclosing sets and /* in all possible ways, we find that the corresponding inequality must hold for the lower bounds of L(Ii) and L(l 2 ), so that we may write L{S) H- L{S*)~^ or {4.4.2) Let iS, , /Sg, . . . be a given sequence of sets ivith or ivithout common points. According to the definition of outer measure, we can for every )t find In such that Sn < In and L(/„)<I(5„) + |.. where e is arbitrarily small. We then have + *S’g -f c: /j + /g + ■ , and from (4.3.7) we obtain L[S, + 5g + •••)^i(^i + A ^ ■ •) -< X(<Sj) + L(*Sg) -h • • + e(i + j + •' ). Since e is arbitrary, it follows that (4.4.3) L(S, -f .Sg + • •) S L[S,) -f X(6g + ■ • . Tn order to deduce a corresponding inequality for the inner measure Ij{S), we consider two sets and withoni coimnon points. Let the complementary sets S* and S* be enclosed in and /g respectively. Abbreviating the words > lower bound of by »1. b.> , we then have ^ b-a - L {S,) - L (67) = 1. b. L (/,), h- a- L (6s) = L (6?) = 1. b. i (/,). , where the enclosing sets /, and ig have to be chosen in all possible ways. Further, we have by (1.3.1) 27 4.4 (S, + - St St < /, /£. but here we can only infer that (4.4.5) h-a - L [S, + 62) - Z [(S, + S^]*] ^l.h.L (/, /j), sincp there may be other enclosing T-sets for {Si + S^* besides those of the form IiJ^. From (4.4.4) and (4.4.5) we deduce, using (4.3.8), L{S, + St) - L(Si) - L(St) 1. b. [L(7,) + L(/2)] - 1. b. Z(/, /,) - (6-a) ^ 1, b. [Ldi) + Ldt) - L di J*)] - (ft-a) ~ 1. b. Z(/j + — (ft— fl). Since Si and S2 have no common point we have, however, 8182 = 0 and + /2 > -f 8 t = (S^ 82 )* = («, h). On the other hand, and I2 are both contained in (a, h), so that l2< [a, h). Thus 4- 72 = h)^ and 7(6\ -h 52) ^ lAS^) + Let now Si, S2, ... be a sequence of sets, m iivo of tvhich have a vommon point. By a repeated use of the last inequality, we then obtain (4.4.6) L (Si + S2 + • ■ * ) I (^^1) + L (62) + ■ " • In the particular case when 8 is an interval, it is easily seen from the definitions that L[8) and Tj[S) are both equal to the length of the interval. If 7=2' is a set in IS, where the are intervals with- out common points, we then obtain from (4.4.3) and (4.4.6) 7 : (y)<:^L (/..), 74/)^2-z(Z), and thus by (4.4.2) and (4.3.2) (4.4.7) X(7) = L(/)=7(/). ^ Finally, we observe that the outer and inner measures are inde- pendent of the interval [a, h) in which we have assumed all our sets to be contained. By 2.2, a bounded set 8 is always contained in the closed interval (or, /?), where or and ^ are the lower and upper bounds of S. If (a, h) is any other interval containing^ S, we must have a ^ or and J S /?. A simple consideration will then show that the two inter- vals (a, h) and (or, /9) will yield the same values of the outer and inner measures of 8 . Thus the quantities L[ 8 ) and L[S) depend only on the set S itself, and not on the interval (a, h). 28 4.5 4,5. Measurable sets and Lebesgue measure. — A hounded set S will be called measurable^ if its outer and inner Tneasures are equal. Their common value will then be denoted by L[S) and called the Lebesgue yneasure or simply the measure of S: l[S)=^L[S)=-L{S)- An unbounded set S will be called measurable if the product ixS, where ix denotes the closed interval {—x, x), is measurable for every X, > 0. The measure L (S) will then be defined by the relation L (S) = lim 1j [ix S). j--— 00 By (4.4.1), L(ixS] is a never decreasing function of Thus the limit, which may be finite dr infinite, always exists. In the particular case when S is a set in 3, the new definition of measure is consistent with the previous definition (4.3.2). For a bounded set /, this follows immediately from (4.4.7). For an unbounded set /, we obtain the same result by considering the bounded set 2.,. I and allowing x to tend to infinity. According to (4.4.1), L{S) and L[S) are both monotone functions of the set S. It then follows from the above definition that the same holds for L[S), For any two measurable sets and such that Sx c: $2 we thus have (4.5.1) L{Sx)^L[S2 \ We shall now show that the measure L[S) satisfies the conditions a}~e) of 4.2. — With respect to the conditions a) and c), this follows directly from the above, so that it only remains to prove that the condition b) is satisfied. This is the content of the following theorem. If measurable sets, no Uvo of which have a common point, then the sum + • ■ is also measurable, and we hate (4.5.2) L(Sx + .V2 + ^ ^'(-"^2) + ••• • Consider first the case when *Si, . . . are all contained in a finite interval (a, b). The relations (4.4.3) and (4.4.6) then give, since all the Sn are measurable, L(Sx-h L(S 2 )^- - -L(6 'i)^ J^('S 2)+ L(Sx -f ^^2 + •••) ^ + />(.S^ 2 ) + • =L{Sx) + 1.(82) + ■ • . By (4.4.2) we have, however, L(Si + 6^2 + ) ^ L{Si + 82 +--), and thus 29 4.5-6 Li [Si + ^2 ‘ ‘ ^ S 2 + * ■ ) — L ( Ai) + L [S^ “i" ‘ ■ ■ j so that in this case our assertion is true. In the general case, we consider the products 4 Si, ixS^^ . . all of which are contained in the finite interval ix. The above argument then shows that the product ix [Si -f Sg + ) is measurable for any x, and that ^ [^J' + 62 + • )\ — L S'!) + Ij (lx S2) + • ■ ' . Then, by definition, Si + iS'g + is measurable and we have, since every term of the last series is a never decreasing function of x, L(Si 4- ^2 ‘ li™ \L(ixSi) + L(ixS2) 4" • • ] a — * 00 = i(.S\) + L(6',) Thus (4.5.2) is proved, and the Lebesgue measure L(S) satisfies all three conditions of 4.2. A set S such that L(S) = 0 is called a set of measure zero. If the outer measure L[S) = 0, it follows from the definition of measure that S is of measure zero. We have seen in 4.3 that, in particular, any enumerable set *has this property. — The following two propositions are easily found from the above. Any subset of a set of measure zero is itself of measure zero. The sum of a sequence of xsets of measure zero is itself of measure zero. — These propositions are in fact direct con- sequences of the relations (4.4.1) and (4.4.3) for the outer measure. 4.6. The class of measurable sets. — Let us consider the class ii of all measurable sets in Ri. We are going to show that 2 is an additive class of sets (cf 1.6). Since we have seen in the preceding paragraph that 2 contains all intervals, it then follows from 2.3 that 2 contains the whole class Sj of all Borel sets, so that all Bore! sets are measurable. We shall, in fact, prove that the class 2 satisfies the conditions «i)> bi) and Ci) of 1.6. With respect to Oi), this is obvious, so that we need only consider 6i) and Cj), Let us first take Ci). It is required to show that the complement S* of a measurable set S is itself measurable. Consider first the case of a bounded set S and its complement S* with respect to some finite interval (a, b) containing S. By the definition of inner measure (4.4) we then have, since S is measurable, I {s*) =.h~a-L (S) ^b-a-L (S) = I (S *) , 30 4.6 so that S* is measurable, and has the measure b—a — L(*S’). — In the general case when S is measurable but not necessarily bounded, the same argument shows that the product S*, where 6* is now the complement with respect to the whole space jRi, is measurable for any a; > 0. Then, by definition, S* is measurable. Consider now the condition fej). We have to show that the sum *^1 + <^2 + • of any measurable sets /Sj, . . . is itself measurable. — In the particular case when S(t Sr = 0 for v, this has already been proved in connection with (4.5.2), but it still remains to prove the general case. It is sufficient to consider the case when all Sn are contained in a finite interval (a, b). In fact, if our assertion has been proved for this case, we consider the sets ij-Si^ ixS 2 , . . and find that their sum ix(Si -h S 2 -i- • ■) is measurable for any a; > 0. Then, by definition, + S 2 + • is measurable. We thus have to prove that, if the measurable sets Si, S 2 , . . . are all contained in (a, b), the sum Si -f S 2 + is measurable. We shall first prove this for the ])articular case of only two sets Si and S 2 . Let n denote any of the indices 1 and 2, and let the complementary sets be taken with respect to («, b). Since S^, and are both measurable, we can find two sets In and In in such that (4.6.1) Sn < In [a, b), Si c: J,, <. [a, b), while the differences L(/„) — L(*S'„) and 7>(f/„) — L[Sl) are both smaller than any given £ > 0. Now by (4.6.1) any point of {a, b) must belong to at least one of the sets In and Jn , so that we have /» + Jn ~ («, b), and thus by (4.3.8) L ( h Jn) = L {In) + L (Jn) ~ (b~a) (4.6.2) - L (In) -f L (Jn) - L (Sn) - L (SI) < 2 f . It further follows from (4.6.1) that aSi + S 2 < ii + I^y (Si^ S2)^=^ SI S^<JlJ2, and hence L{S^ + S^)^L{I^ + h\ (4.6.0) L(Si + 52) S 6~a~L(Ji J2). ' By the same argument as before, we find that 7i + 72 -f J 1 J 2 ~ The relations (4.6.3) then give, using once more (4.3.8), 31 4.6-7 T4Si + s^) - lASi + s^)^L [(/i + h)Ji J^]. Now (A + AlAi As ~ A«^i^2 + A'A'A < A«A + A ^2? so that we obtain by means of (4.5.1), (4.3.7) and (4.6.2) L (6i 4- S 2 ) — L (^1 ^ 2 ) = (A A) A (A eA) ^ • Since £ is arbitrary, and since the outer measure is always at least equal to the inner measure, it then follows that L(Si + ^> 2 ) = A(A *A)» so that Si 4 S 2 is measurable. It immediately follows that any sum Si 4- 4- Sn of a finite number of measurable sets, all contained in (a, 6), is measurable. The relation S 1 S 2 . . . Sn (Sr 4* (- SI)* then shows that the same property holds for a product. Consider finally the case of an infinite sum. By the transforma- tion used in 1 .4, we have S = Si ■¥ -S 2 + • * = A + -^2 + * » where Zv — Si . . . St-\ Sr, and Z(, Zv=^0 for fiZ- v. Since S*, . . ., St-\ and Sv are all measurable, the finite product Zr is measurable. Finally, by (4.5.2), the sum Zi 4- + ‘ is measurable. We have thus completed the proof that the meamrahle sets form an additive class il. It follows that any sum, prodtict or difference of a finite or ennmerahle number of measurable sets is itself measurable. In particular, all Borel sets are measurable. 4.7, Measurable sets and Borel sets. — The class of measurable sets is, in fact, more g’eneral than the class of Borel sets. As an illustration of the difference in grenerality between the two classes, we mention without proof the following proposition: Any measurable set is the sum of a Borel set and a set of measure zero. All sets oc- curring in ordinary applications of mathematical analysis are, how- ever, Borel sets, and we shall accordingly in general restrict ourselves to the consideration of the class and the corresponding class S5n in spaces of n dimensions. We shall now prove the statement inade in 4.2 that the Lehesgue measure is the only set function defined for all Borel sets and satisfying the conditions a) -c) of 4.2. Let, in fact, A [S] be any set function satisfying all the conditions just stated. For any set I in % we must obviously have A (7) = L(I), since our definition (4.3.2) of L(l) was directly imposed by the condi- tions b) and c) of 4.2. Let now S be a bounded Borel set, and en- 32 4.7-5 •! close S' in a sum I of intervals. From the conditions a) and b) it then follows that we have A[S)'^ A(I) = L[I), The lower bound of L(i) for all enclosing I is equal to L(S\ and so we have A{S)^ L{S), Replacing S by its complement S* with respect to some finite interval, we have A[S*) ^ L{S*^), and hence Thus A{S) and L [8) are identical for all bounded Borel sets. This identity holds even for unbounded sets, since any unbounded Borel set may obviously be represented as the sum of a sequence of bounded Borel sets. We shall finally prove a theorem concerning the measure of the limit (cf 1.5) of a monotone sequence of Borel sets. By 2.3, we know that any such limit is always a Borel set. For a nonAecreasing sequence 51,52,... of Borel sets we have (4.7.1) limL(5„) = L(lim 5n). For a non-increasing sequence, the same relation holds provided that L(Si) is finite. For a non-decreasing sequence we may in fact write lim 8n = 8i + (52 — 5i) -f (5^ — 52 ) -f • ■ , and then obtain by (4.5.2) i(lim 5„) = 7>(5i) + L(52-5 i) + ••• — lim [L (5i) L (52 — 5i) 4- • • • + /> (5/1 — 5w-i)] = lim L [Sn). For a non-increasing sequence such that X(5j) is finite, the same rela- tion is proved by considering the complementary sets 5? with respect to 5i. — The example 5n = (w, + oc) shows that the condition that L(5j) should be finite cannot be omitted. CHAPTER 5. The Lebesgue Integral for Functions of One Variable. 5.1. The integral of a bounded function over a set of finite measure. — All point sets considered in the rest of this book are Borel sets, un- less expressly stated otherwise}) Generally this tvill not he explicitly men- tioned, and should then always he tacitly understood^ In order to give a full account of the theory of the Lebesgue integral, it would be necessary to consider measurable sets, and not only Borel sets. As stated in 4.7 the restriction to Borel sets is, however, amply sufficient for our purposes. 33 3—464 H. Cramir 5.1 Let ^ be a given set of measure L(S)t and ff(x) a function of the real yariable x defined for all values of x belonging to S'. We shall suppose that ^ (a?) is Ixmnded in S, i. e. that the lower and upper bounds of ff(x) in S are finite. We denote these bounds by vti and Jf respectively, and thus have m'^g(x)‘^M for all x belonging to S. Let us divide S into a finite number of parts Sj, S 2 , . . Sn, no two of which have a common point, so that we have S = Sj 4* S 2 + ■ ■ ■ " 1 “ Sn, (Sfi Sr — 0 for fi 7 *^ r). In the set Sr , the function g (a:) has a lower bound iWr and an upper bound Mv , such that m ^ ^ Mv ^ M. We now define the lower and upper Darhoux sums associated with this division of S by the relations (5.1.1) (Sr). Z=-^M.L (Sr). 1 1 It is then obvious that we have mL{S)^z ^Z^ML[S), It is also directly seen that any division of S superposed on the above division, i. e. any division obtained by subdivision of some of the parts Sr, will give a lower sum at least equal to the lower sum of the original division, and an upper sum at most equal to the upper sum of the original division. Any division of S in an arbitrary finite number of parts without common points yields, according to (5.1.1), a lower sum z and an upper sum Z, Consider the set of all possible lower sums z, and the set of all possible upper sums Z. We shall call these briefly the ^'-set and the Z-set. Both sets are bounded, since all z and Z are situated between the points mL{S) and ML{S\ We shall now show that the upper hound of the z-set is at most equal 'to the lower hound of the Z^set. Thus the two sets have at most one common point, and apart from this point, the entire z eet is situated to the left of the entire Z-set. In order to prove this statement, let z' be an arbitrary lower sum, corresponding to the division S' ~ 51 + ■ + Sn' , while Z" is an ar- bitrary upper sum, corresponding to the division S= Si' 4 • -f Sn"- It is then clearly sufficient to prove that we have z* ^ Z". This fol* 34 5.1 lows, however, immediately if we consider the division ^ 2 t = l jk=:l which is superposed on both the previous divisions. If the corres- ponding; Darboux sums are Zq, we have by the above remark / ^ ^ Zo ^ 7J\ and thus our assertion is proved. The upper bound of the .s'-set will be called the \ower integral of g{;x) over S, while the lower bound of the Z-set will be called the upper integral of g{x) over S. We write J g (x) dx == upper bound of ^'-set, {5.1.2) - J g {x)dx = lower bound of ^-set. s It then follows from the above that we have (5.1.3) mL{S) S / g[T)dx ^Jff(x)dx ^ ML{S). If the lower and upper integrals are equal (i. e. if the upper bound of the ^-set is equal to the lower bound of the Z-set), g (a?) is said to be integrahle in the Lebesgue sense over S, or briefly integrahle over S. The common value of the two integrals is then called the Lebesgue integral of g[x) over and we write J g[.r)(ix = f g{x) dx = f g(x)dx. ^ .S S’ A necessary and sufficient condition for the integrability of six) over S is that, to every £ > 0, we can find a division of S such that the corresponding difference Z — ^ is smaller than e. In fact, if this condition is satisfied, it follows from our definitions of the lower and upper integrals that the difference between these is smaller than £, and since e is arbitrary, the two integrals must be equal. Conversely, if it is known that g(x) is integrahle, it immediately follows that there must be one lower sum 7 and one upper sum Z'\ such that Z" — / < e. The division superposed on both the corresponding divi- sions in the manner considered above will then give a lower sum and an upper sum Zq such that Zq — Zq<€. It ivill be seen that all this is perfectly analogous to the ordinary text- book definition of the liiemann integral. In that case, the set S is an interval which is divided into a finite number of sub-intervals Sr, and 35 5.1 the Darboux sums s and Z are then formed according^ to (5.1.1), where now X(S'v) denotes the length of the r:th sub-interval The only difference is that, in the present case, we consider a more g^eneral class of sets than intervals, since S and the parts may be any Borel sets. At the same time, we have replaced the length of the interval Sr by its natural generalization, the measure of the set Sv. In the particular case when S' is a finite interval [a, 6), any division of [a, h) in sub-intervals considered in the course of the definition of the Riemann integral is a special case of the divisions in Borel sets occurring in the definition of the Lebesgue integral. In the latter case, however, we consider also divisions of the interval (a, h) in parts which are Borel sets other than intervals. These more general divisions may possibly increase the value of the upper bound of the ^-set, and re- duce the value of the lower bound of the Z-set. Thus we see that the lower and upper integrals defined by (5.1.2) are situated between the corresponding Riemann integrals. If g[x) is integrable in the Rie- mann sense, the latter are equal, and thus a fortiori the two integrals (5.1.2) are equal, so that g{x) is also integrable in the Lebesgue sense, with the same value of the integral. When tve are concerned irith func- tions integrable in the Riemann sense^ and with integrals over an inteival^ it is thus not necessary to distinguish betfveen the two kinds of integrals. The definition of the Lebesgue integral is, of cour.se, somewhat more complicated than the definition of the Riemann integral. The introduction of this complication is justified by the fact that the properties of the Lebesgue integral are simpler than those of the Riemann integral. — In order to show by an example that the Lebesgue integral exi-sts for a more general class of functions than the Riemann integral, we consider a function g{x) equal to 0 when x is irrational, and to 1 w hen x is rational. In every non-degenerate interval this function has the lower bound 0 and the upper bound 1. The low er and upper Darboux suiUvS occurring in the definition of the Rie- mann integral of g{x) over the interval (0, ]) are thus, for any division in sub- intervals, equal to 0 and 1 respectively, so that the Riemann integral does not exist. If, on the other hand, we divide the interval (0, 1) into the two parts and iSV, containing respectively the irrational and the rational numbers of the interval, g{x'^ is equal to 0 everywhere in and to 1 everywhere in Sr. Further, S^ has the measure 1, and Sr the measure 0, so that both Darboux sums '5.1.1) corresponding to this division are equal to 0. Then the lower and upper integrals (6.1.2) are both equal to 0, and thus the Lebesgue integral of g (a;) over (0, 1) exists and has the value 0. The Lebesgue integral over an interval (a, 6) is usually written in the same notation as a Riemann integral; h j g{x)dx. H 36 5.1-2 We shall see below (of 5.3) that this integral has the same value whether we consider («, b) as closed, open, or half-open. — In the particular case when g{x) is continuous for the integral T G(a:) = / 9{t)dt a exists as a Biemann integral, and thus a fortiori as a Lebesgue inte- gral, and we have (5.1.4) G'(x) =- g(x) for all X in («, b). 5.2. B-measurable functions. — A function g(x) defined for all in a set is said to be mecisurable in the Borel s^ense or B measurable in the set xS if the subset of all points in S such that g[x)^h is a Borel set for every real value of i. We shall prove the following important theorem: If g{x) is hounded and B measurahle in a set S of finite measure^ then g[x) is integrahle over S. Suppose that we have 7n < g(x) ^ M for all x belonging to 5. Let f > 0 be given, and divide the interval (m^ M) in sub-intervals by means of points such that wi = Vo <?/,<•'< f/rt-i < yn = M, the length of each sub-interval being < e. Obviously this can always be done by taking n sufficiently large. Now let Sv denote the set of all points x belonging to S such that g{x)^y^, (v = 1, 2, . . m). Then S ~ S', + 4- and 5^ = 0 for v. Further, Sv is the difFerence between the two Borel sets defined by the inequalities g (^r) g y,, and g (x) ^ ^/*_i respectively, so that Sv is a Borel set. The difference Mv — nu, between the upper and lower bounds of^(x)in Sv is at most equal to yv — yv-\ < e. Hence we obtain for the Darboux sums corresponding to this division of S z - ~ ^ il ^ == ^ L {S), 1 1 But € is arbitrarily small, and thus by the preceding paragraph g(x) is integrable over S. 37 5.2 The importance of the theorem thus proved follows from the fact that all functions occurring in ordinary applications of mathematical analysis are B-measurable. — Accordingly, we shall in the sequel only consider^ B-measurable functions. As in the case of the Borel sets, this will generally not be explicitly mentioned, and should then always be tacitly understood. We shall here only indicate the main lines of the proof of the above statement, referring for further detail to special treatises, e. g. de la Vallee Poussin (Ref. 40). We first consider the case when the set iS is a finite or infinite interval (a, h), and write simply »5-measurable» instead of •fi-measurable in (a, b)». If gi and are J9*measurablo functions, the sum gi + ^2» difference gi — g^ and the product gi are also ^-measurable. We shall give the proof for the case of the sum, the other cases being proved in a similar way. I^t k be given, and let U denote the set of all X in (a, b) such that gi + g^ S k, while T/' and f/J.' denote the sets defined by the inequalities gi £ r and g%^ k — r respectively. Then by hypothesis Z7' and U'' are Borel sets for any values of k and r, and it will be verified without difficulty that we have 27=11(1/^+ Z7''), where r runs through the enumerable sequence of all positive and negative rational numbers. Hence by 2.3 it follows that U is a Borel set for any value of k, and thus gi + .<92 is H-measurable. — The extension to the sum or product of a finite number of J9-measurab]e functions is immediate. Consider now an infinite sum g = j9i + .93 + * - * of J^-measurable functions, as- sumed to be convergent for any x in (a, b). Let fx, ^2) • • • ^ ^ decreasing sequence of positive numbers tending to zero, and let Qmn denote the set of all x in (a, h) such that ^x + • * • 4-ffm ^ k tn. Then Qmn is a Borel set, and if we put Rmn = Qmn Qm + l,n . . ., Un = Itln + R2n -f" • * * , U ^ Ui U 2 . . •, some reflection will show that U is the set of all x in (a, b) such that g (./) k. Since only sums and products of Borel sets have been used, f/ is a Borel set, and g (x) is j^-measurable. — Further, if g is the limit of a convergent sequence ^2, . . . of J?*mea8urable functions, >ve may write ^ = ^x + (^2 ~ Si) + ^93 “ .92) + ’ ' * > thns g is ^-measurable. Now it is evident that the function g {x) — c is H-measurable for any constant c and any non-negative integer n. It follows that any polynomial is H-measurable. Any continuous function is the limit of a convergent sequence of polynomials, and is thus .^-measurable. Similarly all functions obtained by limit processes from continuous functions are H-measurable. By arguments of this type, our statement is proved for the case when S is an interval. If g (x) is .B-measurable in (a, b), and, S is any Borel set in (a, fe), the func- tion eCx) equal to 1 in S, and to 0 in S*, is evidently .H-measurable in (a, bl Then the product e(x)gix) is H-measurable in (a,b), and this implies that g(x) is j9-measur- able in S. — If, in particular, S is the set of all x in (a, b) such that g (x) 0, we have I ^ (x) I = .9 (x) — - 2 e (.t) .9 (t). Thus the modulus of a il-measurable function is itself B-measurable. When we are dealing with jB- measurable functions, all the ordinary analytical operations and limit processes will thus lead to J?-measur- 38 5.2-3 able functions. By the theorem proved above, any bounded function obtained in this way will be integrable in the Lebesgue sense over any set of finite measure. For the Biemann integral, the corresponding statement is 7iot true,^) and this is one of the properties that renders the Lebesgue integral simpler than the Riemann integral. We shall finally add a remark that will be used later (cf 14.5). Let g[x) be jB-measurable in a set S. The equation y^g[x) defines a correspondence between the variables x and y. Denote by a given set on the ?/-axis, and by X the set of all x in S such that ^ - We shall then say that the set X corresponds to Y, It is obvious that, if Y is the sum, product or difiFerence of certain sets Yj, I'a, . . ., then X is the sum, product or difiFerence of the cor- responding sets X^, Xg, . . . Further, when Y is a closed infinite inter- val ( — ^,A), we know that X is a Borel set. Now any Borel set may be formed from such intervals by addition, multiplication and subtraction. It follows that the set X corresponding to any Borel set 3^ is a Borel set, 5.3. Properties of the integral. — In this paragraph we consider only hounded functmis and sets of finite measure. — The following propositions (5.3.1) — (5.3.4) are perfectly analogous to the corresponding propositions for the Riemann integral and are proved in the same way as these, using the definitions given in 5.1: (5.3.1 ) j ((/i (.r' + r/g d == / 9i ^ / O 2 W > s > (5.3.2) / c (j (.r) dx = cjg (x) d x, S ^ m L[S)^ j g (x) dxS M L (S)) s j g(x)dx = f g(x)dx + j g[x)dx, Sy + S^ Si S. Even if the limit g{:r) of a sequence of functions integrable in the Kiemann sense is bounded in an interval {a, we cannot assert that the Riemann integral of q (x) over (a, b) exists. Consider, e. g., the sequence g^y . . where g^ is equal to 1 for all rational numbers x with a denominator < n, and otherwise equal to 0. Obviously gn is integrable in the Riemann sense over (0, 1)/ but the limit of gn when n Qo is the function g (.r) equal to 1 or 0 according as x is rational or irrational, and we have seen in the preceding paragraph that the Riemann integral of this function over (0, 1) does not exist. (5.3.3) (5.3.4) 39 5.3 where c is a constant, wi and M denote the lower and upper bounds of g[x) in S. while and are two sets without common points. (5.3.1) and (5.3.4) are immediately extended to an arbitrary finite number of terms. — If we consider the non-negative functions \9{x) \ i 9 (a^)» it follows from (5.3.3) that we have (5.3.5) \l 9 {x)dx\^j\g{x)\dx. a a In the particular case when g{x) is identically equal to 1 , (5.3.3) gives / dx = L(sy s It further follows from (5.3.3) that the integral of any bounded g(x) over a set of measure zero is always equal to zero. By means of (5.3.4) we then infer that, if (ic) and g^ {x) are equal for all x in a set S, except for certain values of x forming a subset of measure zero, then f 9 , (x)dx = f gt(x)dx. Thus if the values of the function to be integrated are arbitrarily changed on a subset of measure zero, this has no influence on the value of the integral. We may even allow the function to be com- pletely undetermined on a subset of measure zero. We also see that, if two sets and S 2 differ by a set of measure zero, the inte- grals of any bounded g{x) over and are equal. Hence follows in particular the truth of a statement made in 5.1, that the value of an integral over an interval is the same whether the interval is closed, open or half-open. It follows from the above that in the theory of the Lebesgue integral we may often neglect a set of measure zero. If a certain condition is satisfied for all x belonging to some set S under con- sideration, with the exception at most of certain values of x forming a subset of measure zero, we shall say that the condition is satisfied almost everyivhere in S or for almost all values of x belonging to S. We shall now prove an important theorem due to Lebesgue con- cerning the integral of the limit of a convergent sequence of func- tions. We shall say that a sequence gi(x\ ^g(r), . . . {^uniformly hounded in the set S', if there is a constant K such that | (,r) | < K for all V and for all x in S. 40 53 If the sequence { 5 r»(a;)} is uniformly bounded in S, and if lim gv((Jc)=g(x) 00 exists almost everywhere in S, we have (5.3.6) lim f g^(x)d X = f g[x)d x. 00 ^ If lim gv (x) does not exist for all x in Sy we complete the defini- tion of g (x) bj putting g{x) = 0 for all x such that the limit does not exist. We then have \g{x)\^ K for all x in S, and it follows from the preceding paragraph that g{x) is .B-measurable in S and is thus integrable over S'. Let now £ > 0 be given, and consider the set Sn of all a: in S such that | gv{x) — g{x) | ^ s for r = w, w -f 1, . . . . Then Sn is a Borel set, the sequence Sj, Sg, . . . is never decreasing, and the limiting set lim Sn (cf 1.5) contains every x in S such that lim gv (x) exists. Thus by hypothesis lim Sn has the same measure as Sy and we have by (4.7.1) lim L(Sn) = L(lim Sn) = L(S). We can thus choose n such that L(Sn)> L(S) — - orL(S— Sn)<«, and then obtain for all v^n / \9v{x) — g(x)\dx=‘ f + f <s[L(S) + 2K]. .S' S-i'„ Since s is arbitrary, and since IJ 9r(x)dx~f g[x)dx\ ^ j \gy[x) ~ g{x)\dx, .S' S S this proves our theorem. The theorem (5.3.6) can be stated in another form as a theorem on term-hy-term integration of a series: 00 If the series ^ /»(^) converges almost everywhere in S, and if the 1 n partial sums 2 (^) uniformly hounded in S, then /, (a;)) dx = ^ J f4x)dx. ^ a / Under this form, the theorem appears as a generalization of (5.3.1) to an infinite number of terms. We shall now show that a corres- (5.3.7) /(I 41 6^4 poDding generalisation of (5.3.4) may be deduced as a corollary from (6.3.7). If S = 8i + Sf , where S^S^ = 0 for ht^v, then (5.3.8) j g{x)dx = '2ij g{x)dx. Let ev(x) denote a function equal to 1 for all x in Sv and other- wise equal to zero. For any x belonging to 5, we then have 00 ff(x)='^e,(x)g{x), 1 and it is obvious that the partial sums of this series are uniformly bounded in S. Then (5.3.7) gives j g{x)dx = ^J e^(x)g(x)dx = ^J' g{x)dx. In the particular case g[x)=^\, (5.3.8) reduces to the additivity relation (4.5.2) for Lebesgue measure. 5.4. The integral of an unbounded function over a set of finite measure. — In 5.1 and 5.2 we have seen that the Lebesgue integral / g{x)dx has a definite meaning under the two assumptions that 1) g{x) is bounded in S, and 2) S is of finite measure. We shall now try to remove these restrictions. In this paragraph, we consider the case when S is still of finite measure, but g[x) is not necessarily bounded in S. Let a and h be any numbers such that a < b, and put a if <7 (??) < a, ga, h(x)= g(x) » a^g(x)^ h, b » g(x) > h. Obviously ga,b(x) is bounded and ^-measurable in 5, and thus inte- grable over S. If the limit 42 5.4 {5.4.1) lim f g„^b(x}dx= I g{.i)dx «-* - 00 t fj~* -f 00 exists and has a finite value, we shall say that (/{x) is intcgrahh over S. This limit is then, by definition, the Lebes^ue integfral of g(x) over S. It follows directly from the definition that any function is inte- grable over a set of measure zero, and that the value of the integral is zero, as in the case of a bounded function. In the definition (5.4.1), we may assume a < 0, h > 0, and then have ga, ft (oc) = ga, ft == ga, 0 + git, 6 , \g(^)\(i,b = l^kft =— g-h,o -f ( 70 , ft. For fixed x, ga,o(^) and go,b(x) are never decreasing functions of a and h respectively. It follows that both g(x) and |/7(^)| are integrable if, and only if, the limits {5.4.2) lim fga, o(x)d.r and lim f gu,b(x)dx a-* — 00 ft-» -f- 00 are both finite. Hence the integrability of g (x) is equivalent with the integrability of |^/(x)l. It further follows that, if g{x) is integrable over S, it is also integrable over any subset of S. If, for all X in jS\ we have | g (x) \ < G (x), where G (x) is integrable over S, we have | g k ft ^ Ga, ft, so that | g (x) | and thus also g (x) are integrable over S. We now immediately find that the properties (5.3.2) — (5.3.5) of the integral hold true for any integrable p(.r). With respect to (5.3.3) it should, of course, be observed that one of the bounds m and 3f, or both, may be infinite. We proceed to the generalization of (5.3.1), which is a little more difficult. Suppose that f[x) and g{x) are both integrable over S. From 1/ + i? |a. 0 0, \f g k b ^ l/lo, ft + k b, it follows that f(x) -f- g[x) is also integrable. We have to show that the property (5.3.1) holds in the present case, i.e. that (^^*•^■3) I (/ -h g)dx f fdx 4* j gdx. Suppose in the first place that f and g are both non-negative in S. Then 4,3 5.4 (/ + 9)a, 0 ~ fa, 0 — Oa,i3 — (/ + 9 ) 0 , h ^ fo. h + go, h ^ (/ r/)o, 2 fc, and hence jif+ff)a.hdx^jfa,hdx + jg„,i,dx^ j (/+ !>}a, if' d.r. s s s s Allowing a and h to tend to their respective limits^ we obtain (5.4,3). — Now S may be divided into at most six subsets, no two of which have a common point, such that in each subset none of the three functions f g and f+g changes its sign. For each subset, (5.4.3) is proved by the above argument. Adding the results and using (5.3.4} we obtain (5.4.3) for the general case. We have thus shown that all the properties (5.3.1) — (5.3.5) of the integral hold true in the present case. In order to generalize also the properties expressed by the relations (5.3.6) — (5.3.8), we shall first prove the following lemma: If g[x) is intrgrahle over Sq, and if e >0 is given, tre can always find d > 0 such that (5.4.4) I j g{x)dx \ < t s for every subset S C S^y n hich satisfies the condition L {S) < d. Since we have seen that (5.3.5) holds in the present case, it is sufficient to prove the lemma for a non-negative function (/(.r). In that case f g d X — lim | ^ 0 , u d x, >0 ' and thus we can find h such that 0 S J (g — go. b) d X < I s. Since the integrand is non-negative, it follows by means of (5.3.4) and (5.3.3) that we have for any subset S c S^^ or I (9 ~ f/o, b) d X < 4 e s I gdx < f .^0. bdx i B > b L (S) J f . .N ’S Choosing the truth of the lemma follows immediately. 44 5.4-5 A consequence of the lemma is that, if g(x) is integrable over an X interval (a, b), the integral f is a continuous function of x for a a < X < b. We can now proceed to the generalization of (5.3.6). Assuming that lim = ff(x} almost everywhere in S, we shall show that the 00 relation (5.4.5) lim f fff>(x)dx= f g(x)dx holds if the sequence {gv[x)\ is uniformly dominated by an integrable function, i. e. if \gv(x)\< G (a:) for all v and for all x in S, where G(x) is integrable over S. — In the particular case G (a;) = const., this re- duces to (5.3.6). The proof is quite similar to the proof of (5.3.6). We first observe that it follows from the hypothesis that | ^ (a;) | ^ G (a;) almost every- where in S; thus gr(x) and g{x) are integrable over S, Given £>0, we then denote by Sn the set of all a; in 5 such that |^t.(a?) — fl^(a:)l ^ e for all v^n. Then 5^, . . . is a never decreasing sequence, and L{Sn) L(S). Using lemma (5.4.4), we now determine d such that f G(x)dx < £ for every S' < S with L (S') < d, and then choose n such S' that L(S„) > L{S)— d, and consequently L(S— Sn) < d. We then ob- tain for all V ^ n f\ffAx) — <J W \dx = f + f .S' .s„ «-.s„ <sL(S) + 2f G(x)dy<«[L(S) + 2], and thus (5.4.5) is proved. — The corresponding generalization of (5.3.7) and (5.3.8) is immediate. 5.5. The integral over a set of infinite measure. — We shall now reniove also the second restriction mentioned at the beginning of 5.4, and consider Lebesgue integrals over sets of infinite measure. Let S be a Bore! set of infinite measure, and denote by Sa^h the product (common part) of S with the closed interval [a, fc), where a and b are finite. Then Sa,b is, of course, of finite measure. If g(x) is integrable over 8a, b for all a and b, and if the limit 45 5.5 lim rt-* — oo 6-»* + ® J \g(x)\dx = f \g{x)\dx Sa.b exists and has a finite value, we shall say that g{x) is integrablc over S.^) It is easily seen that in this case the limit (5.5.1) lim 0 “* — CO 6-* -foe f g(x)dx= f g(x)dx Sa.h also exists and has a finite value, and we shall accordingly say that the Lebesgue integral of g (x) over the set S is convergent^). The limit (5.5.1) is then, by definition, the value of this integral. — If g{x) is integrable over S, it is also integrable over any subset of S. If W for “'ll ^ where G[x) is integrable over S, it is easily seen that g [x) is integrable over S. Since | + S'* 1 ^ ^ l«?il 4- 1 .^ 2 1, it follows that the sum of two integrable functions is itself integrable. It follows directly from the definition that the properties (5.3.1), (5.3.2) and (5.3.4) hold true in the case of functions integrable over a set of infinite measure. Instead of (5.3.3), we obtain here only the inequality j g[x)dx'^0 if (j (x) S; 0 for all x in S. This is, however, sufficient for the deduction of (5.3.5) for any inte- grable g[x). We now proceed to the generalization of (5.4.5), which is itself a generalization of (5.3.6). If lim gr[x)^ g[x) almost everywhere in S, and if l^ri < d, where G is integrable over S, it follows as in the preceding paragraph that \g \ ^ G almost everywhere in S. Conse- quently g(x) is integrable over S, and we can choose a and b such that for all v f \9r — g\dx < 2 I G [x) dx < i e. Now Sa,b is of finite measure, and it then follows from the proof of (5.4.5) that we can choose n such that for all v f \gr~g\dx < * 0 , b *) Strictly speakiDg, we ought to say that g{x) i.s absolutely integrable over S, aod that the integral of g{x) over S is absolutely convergent As we shall only in exceptional cases use non-absolutely convergent integrals we may, however, without inconvenience use the simpler terminology adopted in the text. 46 5.5-6 We then have for v^n j \ff^~g\dx = j + j <e. Since e is arbitrary, we have thus proved the following theorem, which contains (5.3.6) and (5.4.5) as particular cases: If lim ( 7 -v {x) = g (x) exists almost evei‘y where in the set S of finite or v-^v> infinite measure^ and if |^v(:r)| < G (x) for all v and for all x in where G [x) is integrable over S, then g{x) is integi'able over S, and (5.5.2) lim f gv(x)dx = f g{x)dx. r — oo y The theorem (5.5.2) may, of course, also be stated as a theorem on term-by-term integration of series analogous to (5.3.7). — Finally, the argument used for the proof of (5.3.8) evidently applies in the present case and leads to the following generalized form of that theorem: If g {x) is integrable oi'er and if S = + S'g + ’ * ‘ » tvhere Sfi, Sv = 0 for (I r, then GO (5.5.3) j g (.r) dx = ^ J g {x) dx. s 1 ,S’y 5.6. The Lebesgue integral as an additive set function. — Let us consider a fixed non-negative function f[x), integrable over any finite interval, and put for any Borel set S f f (x)dx, if f(x} is integrable over S, (5.6.1) P(^0 = - ' + Qo otherwise. Then P(S} is a non-negative function of the set uniquely defined for all Borel sets S. Let now jS = -i- Sg -h ■ , where K == 0 for V. It then follows from (5.5.3) that the additivity relation P[S)=^P[S,) + holds as soon as P{S) is finite. The same relation holds, however, even if P[S) is infinite. For if this were not true, it would be pos- sible to choose the sets S and Sg, . . , such that P(S) = + qo , while the sum P(iS,) 4- ^( 62 ) + would be finite. This would, however, imply the relation 47 5.6-6.1 f fix) f fix) dx ^a.h ^ {^v)a,b ^2 Jf(x)d,r=^^P(Sr). 1 1 Allowing here n and h to tend to their respective limits, it follows that f(x) would be integrable over S, against our hypothesis. Thus P(S) as defined hy (5.6.1) is a non-negative and additive set function^ defined for all Hovel sets S in H, . In the particular case when f{x) = 1, we have P(S) = L(S), so that P(S) is identical with the Lebesgue measure of the set S. Another important particular case arises when f(x) is integrable over the whole space /l|. In this case, P(S) is always finite, and we have for any Borel set S PiS) ^ jf(x)dx. — CO CHAPTER 6. Non-Negative Additive Set Functions in Ri. 6.1. Generalization of the Lebesgue measure and the Lebesgue Integral. — In Ch. 4 we have determined the Lebesgue measure Z (iS) for any Borel set S, L(S) is a number associated with S or, as we have expressed it, a function of the set S. We have seen that this set function satisfies the three conditions of 4.2, which require that L [S) should be a) non-negative, b) additive, and c) for any interval equal to the length of the interval. We have finally seen that L{S) is the only set function satisfying the three conditions. On the other hand, if we omit the condition c), L[S) will no longer be the only set function satisfying our conditions. Thus e. g. the func- tion P{S) defined by (5.6.1) satisfies the conditions a) and b), while c) is only satisfied in the particular case f\x) == 1, when P[S) = L[S). — Another example is obtained in the following way. LetiCj, Xg, ... be a sequence of points, and jOj, Pa, • • . a sequence of positive quantities. Then let us put for any Borel set S PiS) = J-,, <s 48 6.1-2 the sum being extended to all belonging to S. It is readily seen that the set function P(S) thus defined satisfies the conditions a) and b), but not c). We are thus led to the general concept of a non-negative ami ad- ditive set function, as a natural generalization of the Lebesgue measure L(S). In the present chapter we shall first, in the paragraphs 6.2 — 6.4, investigate some general properties of functions of this type. In the applications to prohnhility theory and statistics, that will be made later in this book, a fundamental part is played by a particular class of non-negative and addith'e set functions. This class ivill be con- sidered in the paragraphs 6.5 — 6.8, In the following Chapter 1, we shall then proceed to show that the whole theory of the Lebesgue integral may be generalized by re- placing, in the basic definition (5.1.1) of the Darboux sums, the Lebesgue measure L[S) by a general non-negative and additive set function P(S). The generalized integral obtained in this way, which is known as the Lebesgue- St icltjes integral, will also be of a fundamental importance for the applications. 6.2. Set functions and point functions. — We shall consider a set function P{S) defined for all Borel sets S and satisfying the follotving three conditions: A) P{S) is nou negative: P{S)^0. B) P[S) is additive: P(S, -f- .V, ) - P{S,) -f P{S,) -f ■ (.S;, S\ = 0 for p ^ v). C) I^{S) is finite for any hounded set S. All set functions considered in the sequel will he assumed to satisfy these conditions. From the conditions A) and B), which are the same as in the par- ticular case of the Lebesgue measure L[S), we directly obtain certain properties of P(<S), which are proved in the same way as the cor- resi)onding properties of L[S). Thus if we have (6.2.1) P{S,)^P[S,). For the empty set we have P(0) = 0. If 5j, . . . are sets which may or may not have common points, we have (cf 4.3.7, which ob- viously holds for any Borel sets) 4 — 4.)4 H. Cramir 49 6.2 (6.2.2) P(S, + .S, + - •) ^ + ^(^ 2 ) + • ■ • • For a non decreasing sequence . . ., we have (cf 4.7.1) (6.2.3) lim P[Sn)=P[\imSn). For a non-increasing sequence, the same relation holds provided that P(S,) is finite. When a set S consists of all points ^ that satisfy a certain rela- tion, we shall often denote the value P[S) simply by replacing^ the sign S within the brackets by the relation in question. Thus e. g. if S is the closed interval {a, 6), we shall write p[S) h). When S is the set consisting of the single point ^ = a, we shall write P(S)^P(? = a), and similarly in other cases. We have called P(iS) a set function, since the argument of this function is a set. For an ordinary function of one or more variables, the argument may be considered as a 'point with the coordinates .... Xn, and we shall accordingly often refer to such a function as a point function. — When a set function P{S) and a con- stant k are given, we define a corresponding point function F[j'\ k) by putting F(k<^^x) for x> k, (6.2.4) F[x’k)^ 0 » x = -~P{x<^^k) » X < k. Whatever the value of the constant parameter i, we then find for any finite interval («, b) F(b; k) - F(a; k) = P(a<^^ b) ^ 0, which shows that F(x; k) is a non-decreasing function of x. If in the last relation we allow a to tend to t* Qc. , or ft to tend to + oo , or both, it follows from (6.2.3) that the same relation holds also for in- finite intervals. — In the particular case when P[S) is the Lebesgue measure L(iS), we have F[x\k) = X'—k. The functions F{x\ k) corresponding to two difiFerent values of the parameter k differ by a quantity independent of x. In fact, if X‘, < k^ we obtain 50 6.2 F(x‘k,)-F{x^k,)=^P{k,<^^1c,Y Thus if we choose an arbitrary value k^ of k and denote the corres- ponding function F(x\ k^) simply by F(x), any other F(x\ k) will be of the form F(x) + const. We may thus say that to any set function P{S) satisfying the condi- tions A)— C), there corresponds a non-decreasing point function F{x) such that for any finite or infinite interval (a, h) we have (6.2.5) F(b) - F(a) =-P(a<^^ 6). P^(x) is uniquely determined except for an additive constant. We now choose an arbitrary, but fixed value of the parameter and consider the corresponding function F[x). Since F(x) is non- decreasing, the two limits from above and from below F{a + 0) = lim F[x\ F[a — 0) = lim F[x) a--*o-*-0 ar-^rt — 0 exist for all values of a, and F{a — 0) ^ F[a -f 0). According to (6.2.5) we have for x > a F(x) - F(a) = P(a < ? ^ x). Consider this relation for a decreasing sequence of values of x tending to the fixed value a. The corresponding half-open intervals a < x form a decreasing sequence of sets, the limiting set of which is empty. Thus by (6.2.3) we have F(x) — F(a) -► 0, i. e. P(a + 0) = F(a). On the other hand, for x c a F(a)-F(x) = P(x<S^a}, and a similar argument shows that - 0) = Fia) ~ P(? -- a) ^ F (a). Thus the function F{x) is always continuous to the right. For every value of X such that P(^ = x)>0, F[x) has a discontinuity with the saltus P(5 = .x). For every value of x such that P(5 = x)==0, F(x) is continuous. Any X such that P(S) takes a positive value for the set S con- sisting of the single point x, is thus a discontinuity point of P^x). 51 6.2 These points are called discontinuity points also for the set function P(S), and any continuity point of F(x) is also called a continuity point of P(S). The discontinuity points of P{S) and P'(x) form at most an enumer- able set. - - Consider, in fact, the discontinuity points x belonging to the interval in defined by n < x^n and such that P($ = a;) > ^ • c Let Sv be a set consisting of any v of these points, say .t*.. Since Sy is a subset of the interval we then obtain P{in) P{S.) ■--- Pis = a;,) + • ■ • + Pi§ = X,) > ", c or v<cP(iu). Thus there can at most be a finite number of points X, and if we allow c to assume the values c — 1, 2, . . we find that the discontinuity points in /„ form at most an enumerable set. Summing over — 0, ± 1 , t 2 . . we obtain (cf 1.4) the proposition stated. Let now ... be all discontinuity points of P(S) and F(x), let X denote the set of all the points x ^, and put P ^ = /a . For any set 5, the product set SX consists of all the points Xr be- longing to iS, while the set S — S X — S X* contains all the remaining points of S. We now define two new set functions P, and P^ by writing (6.2.G) P, (.S’) - P(.S’X)= 2i = P{SX*). 4-,.C.S It is then immediately seen that P, and P. both satisfy our conditions A) — (J). Further, we have S~SX + SX*, and hence (6-->.7) P(6’)=p,(5) + Pj(.S’). It follows from (().2.«) that P, (.S’) is the sum of the saltuses />,. for all discontinuities x» belonging to S. Thus P, (S) = 0 for a set S which does not contain any x.. On the other hand, (6.2.6) shows that Pi(S) is everywhere continuous, since all points belonging to X* are continuity points of P(S). Thus (6.2.7)' gives a decomposition of the non-negative and additive set function P(S) in a discontinuous part P, (S) and a continuous part Ps(*S’). -^1 o-ud Z * 2 are the non-decreasing point functions corres- ponding to P, Pi and Pj, and if we choose the same value of the additive constant k in all three cases, we obtain from (6.2.4) and (6.2.7) 6.2-3 {( 12 . 8 ) F(x)^FAx) + F,(x}. Here, is everywhere continuous, while Fi is a » step-function >, which is constant over every interval free from the points Xr, but has a »step >> of the height pv in every — It is easily seen that any non-decreasing function F(x) may be represented in the form (6.2.8), as the sum of a step-function and an everywhere continuous function, both non-decreasing and uniquely determined. 6.3. Construction of a set fimction. — We shall now prove the following converse of theorem (6.2.5): To any non decreasing point function F[x), that is finite for all finite X and is ahvays continuous to the right, there corresponds a set function P[S), uniquely determined, for all Borel sets S and satisfying the con ditions A) — C) of 6.2, in such a way that the relation F[b)- F{a)==P(a<^^h) holds for any finite or infinite intewal (a, b). — It is then evident that two functions Fi(x) and F^[x) yield the same P(S) if and only if the difference F, — F^ is constant. Comparing this with theorem (6.2.5) we find that, if two functions F^ and F^ differing by a constant are counted as identical, there is a one-to-one correspondence between the set functions P(iS) and the non-decreasing point functions F[x). In the first place, the non-decreasing point function F[x) deter- mines a non-negatii'c interval function P{i), which may be defined as the increase of F[x) over the interval i. For any half-open interval defined by a < x ^ b, P(i) assumes the value P(a < x ^b)=^F(b)—P\a). For the three other types of intervals with the same end-points a and b we determine the value of 2^(/) by a simple limit process and thus obtain P{a ^x^b) = F(b) - F(a - 0), P{a<x< b) = F(b - 0) - F{a), ( 6 . 3 . 1 ) ) \ \ h P[a<x^h)^F{b)-F[a\ P[a ^ X < i) = F[b - 0) - F[a - 0), so that P(i) is completely determined for any interval i. The theorem to be proved asserts that it is possible to find a non- negative and additive set function, defined for all Borel sets S, and equal to P{i) in the particular case when S is an interval /. 53 6.3 This is, however, a straigchtforward ^neralization of the problem treated in Ch. 4. In that chapter, we have been concerned with the particular case F[x) = x, and with the corresponding interval function: the length L(i) of an interval i. The whole theory of Lebesgue measure as developed in Ch. 4 consists in the construction of a non- negative and additive set function, defined for all Borel sets S and equal to L(i) in the particular case when 8 is an interval i. It is now required to perform the analogous construction in the case when the length or ^L-measure^' of an interval, L(i) =^b ~ a, has been replaced by the more general » P'lneasijire^ P(i) defined by (6.3.1). Now this may be done by exactly the same method as we have applied to the particular case treated in Ch.. 4. With two minor ex- ceptions to be discussed below, every word and every formula of Ch. 4 will hold good, if 1) the words measure and measurable are throughout replaced by P-measure and P-mea$urable, 2) the length L(i)=^b — a of an interval is replaced by the P-measure P(t), and 3) the signs L and 2 are everywhere replaced by P and In this way, strictly following the model set out in 4.1 — 4.5, we establish the existence of a non negative and additive set function P(S'), uniquely defined for a certain class ^ of sets that are called P-measurable, and equal to P(;) when S is an inter- val i. Further, it is shown exactly as in 4.6 that the class of all P- measurable sets is an additive class and thus contains all Borel sets. Finally, we prove in the same way as in 4.7 that P(S) is the only non-negative and additive set function defined for all Borel sets, which reduces to the interval function P{i) when S is an interval. In this way, our theorem is proved. Moreover, the proof explains why it will be advantageous to restrict ourselves throughout to the consideration of Borel sets. We find, in fact, that although the class of all P-measurable sets may depend on the particular function F(x) which forms our starting point, it always contains the whole class 33, of Borel sets. Thus any Borel set is always P-measurable, and the set function P(S) corresponding to any given F(x) can always be defined for all Borel sets. It now only remains to consider the two exceptional points in Ch. 4 referred to above. The first point is very simple, and is not directly concerned with the proof of the above theorem. In 4.3 we have proved that the Lebesgue measure of an enumerable set is always equal to zero. This follows from the fact that an enumerable set may be considered as the sum of a sequence of degenerate intervals, each of which has the length zero. The corresponding proposition for P- 54 6.3 measure is obviously false, as soon as the function F(x) has at least one discontinuity point. A degenerate interval consisting of the single point a may then well have a positive P-measure, since the first rela- tion (6.3.1) gives P[x = fl) = F(a) - F(a - 0). As soon as an enumerable set contains at least one discontinuity point of F(x), it has thus a positive P-measure. The second exceptional point arises in connection with the gener- alization of paragraph 4.1, where we have proved that the length is an additive interval function. In order to prove the same proposition for P-measure, we have to show that (6.3.2) = + where i and ig, . . . are intervals such that i = + ig -h • • and V ” 0 for fM 7 ^ V. For a continuous F{x)^ this is shown by Borel’s lemma exactly in the same way as in the case of the corresponding relation (4.1.1), re- placing throughout length by P-measure. Let us, however, note that in the course of the proof of (4.1.1) we have considered certain inter- vals, e. g. the interval (a — a + f) which is chosen so as to make its length equal to 2 e. When generalizing this proof to P-measure, we should replace this interval by (a — A, a -h A), choosing h such that the P-measure F[a + h) — F(a — h) becomes equal to 2f. On the other hand, if F{x) is a step-function possessing in i the discontinuity points . . . with the respective steps . . ., we have P(i) = 2^.. P(f:„) = 2 P- Since no two of the in have a common point, every Xv belongs to exactly one in, and it then follows from the properties of convergent double series that (6.3.2) is satisfied. Finally, by the remark made in connection with (6.2.8) any F(x) is the sum of a step-function P\ and a continuous component Pg, both non-decreasing. For both these functions, (6.3.2) holds, and thus the same relation also holds for their sum P'(x). — We have thus dealt with the two exceptional points arising in the course bf the general- ization of Ch. 4 to an arbitrary P-measure, and the proof of our theorem is hereby completed. 55 6.4‘-6 6.4. P- measure. — A set function P(S) satisfying the conditions A)— (7) of 6.2 defines a P-meamrc of the set S, which constitutes a generalization of the Lebesgue measure L(S). Like the latter, the P*measure is non-negative and additive. By the preceding paragraph, the P-measure is uniquely determined for any Borel set S', if the corresponding non-decreasing point func- tion F{x) is known. Since, by 6.2, F{x) is always continuous to the right, it is sufficient to know F(x) in all its points of continuity. If, for a set S, we have P(S) = 0, we shall say that S is a set of P-measure zero. By (6.2.1), any subset of S is then also of P-measure zero. The sum of a sequence of sets of P-ineasure zero is, by (6.2.2), itself of P-ineasure zero. If P\a)~F{h), the half-open interval a < X ^ h is of /^-measure zero. When a certain condition is satisfied for all points belonging to some set S under consideration, except possibly certain points forming a subset of P-measure zero, we shall say (cf 5.3) that the condition is satisfied almost evcnjrrhere (P) or for almost all [P) points in the set S. 6.5. Bounded set functions. — For any Borel set S we have by (6.2.1) /^(S) P(Pi). If P(Pi) is finite, we shall say that the set function P(S) is hounded. When P[S) is bounded, we shall always fix the additive constant in the corresponding non-decreasing point func- tion F(.r) by taking k — — oo in (6.2.4), so that we have for all values of X (6.5.1) 7/(^) = p(5^x). When X tends to — oo in this relation, the set of all points tends to a limit (cf 1.5), which is the empty set. Thus by (6.2.3) we have F( — co) = 0. On the other hand, when x -1- oo , the set^^x tends to the whole space jR^, and (6.2.3) now gives P'(4- co) .:^ P(Ri)- Since F{x) is non-decreasing, we thus have for all x (6.5.2) {)^ F(x)^P(R,). 6.6. Distributions. — Non-negative and additive set functions P{S) such that P(R|) = 1 play a fundamental part in the applications to mathematical probability and statistics. A function P(S) belonging to this class is obviously bounded, and the corresponding non-decreasing point function F{x) is defined by (6.5.1), so that 56 ( 6 . 6 . 1 ) F(x} — z), O^F(x)^l, f’(-co) = 0, F(+oo)=l. 6.6 A pair of functions P(S) and F(z) of this type will often be con- cretely interpreted by means of a distribution of mass over the one- dimensional space Aj. Let us imagine a unit of mass distributed over Aj in such a way that for every x the quantity of mass allotted to the infinite interval ^ ^ a; is equal to F(x). The construction of a set function P(/S) by means of a given point function F{x), as explained in 6.3, may then be interpreted by saying that any Borel set S will carry a determined mass quantity P(/S). The total quantity of mass on the whole line is P(R^) = 1, We are at liberty to define such a distribution either by the set function P{S) or by the corresponding point function P'(x). Using a terminology adapted to the applications of these concepts that will be made in the sequel, we shall call P(/S) the probability function of the distribution, while F[x) will be called the distribution function. Thus a distribution function is a non-decreasing point function F(x) which is everywhere continuous to the right and is such that P(— co)~0 and i^’(+ 00 ) = !. Conversely, it follows from 6.3 that any given with these properties determines a unique distribution, having I\x) for its distribution function. If Xq is a discontinuity point of F{x), with a saltus equal to Po, the mass will be concentrated in the point a*o, which is then called a discrete mass point of the distribution. On the other hand, if is a continuity point, the quantity of mass situated in the interval (x — A, x -f h) will tend to zero with h. The ratio PU + h]-F(x ~~2h ' -A) is the mean density of the mass be- longing to the interval x — A < ^ g .x + A. If the derivative P'(.x) = /(x) exists, the mean density tends to /(x) as A tends to zero, and accord- ingly /(x) represents the density of mass at the point x. In the ap- plications to probability theory, f(x) will be called the probahility density or the frequency function of the distribution. Any frequency function f[x) is non-negative and has the integral t over (— Qo, go). From (6.2.7) and (6.2.8) it follows that any distribution may be decomposed into a discontinuous and a continuous part by writing 57 6.ft-7 P(S) = c,P,(5) + c,P,{5), F (x) = Cl Pi (x) + c, P, (x). Here and c, are non*ne^ative constants such that + Cg = 1 . Pj and Fj denote the probability function and distribution function of a dis- tribution, the total mass of which is concentrated in discrete mass points (thus Pj is a step-function). Pg the other hand, correspond to a distribution without any discrete mass points (thus Pj is everywhere continuous). The constants and Cg, as well as the functions P, , Pj, Pi and are uniquely determined by the given distribution. In the extreme case when = i, C 2 — 0, the distribution function F(x) is a step-function, and the whole mass of the distribution is con- centrated in the discontinuity points of P(x), each of which carries a mass quantity equal to the corresponding saltus. The opposite ex- treme is characterized by = 0, Cg = 1, when F{x) is everywhere con- tinuous, and there is no single point carrying a positive quantity of mass. In Ch. 15 we shall give a detailed treatment of the general theory of distributions in R^. In the subsequent Chs. 1(5—19, certain im- portant special distributions will be discussed and illustrated by figures. At the present stage, the reader may find it instructive to consult Figs 4 — 5 (p. 169), which correspond to the case Ci= 1, ^2 = 0, and Figs 6 — 7 (p. 170 — 171), which correspond to the case c, — 0, 6.7. Sequences of distributions. — An interval (a, h) will be called a continuity interval for a given non-negative and additive set function P[S), and for the corresponding point function F[x), when both ex- tremes^) a and h are continuity points (cf 6.2) of P[S) and F[x). If two set functions agree for all intervals that are continuity intervals for both, it is easily seen that the corresponding point functions P(.t) differ by a constant, so that the set functions are identical. Consider now a sequence of distributions, with the probability func- tions P| (S), P2[S), . . . and the distribution functions Fi{x\ Fi[x), We shall say that the sequence is convergent, if there is a non-negative and additive set function P(S) such thtft Pn(S) P{S) whenever S is a continuity interval for P(S), Since we always have 0^P«(S)^ 1, it follows that for a con- vergent sequence we have 0gP(4S)^l for any continuity interval Note that any mner point of the interval may he a discontinuit3\ The name of continuity-bordered interval^ though longer, would perhaps be more adequate. 58 6.7 S = (a, b). When a — oc and b -*■ + oo , it then follows from (6 2.3) that P(R^)^^, — The case when P(Jl,)=l is of special interest. In this case P(S) is the probability function of a certain distribution, and w^e shall accordingly say that our sequence convei^ges to a distribu tioH, viz. to the distribution corresponding to P(S). — Usually it is only this mode of convergence that is interesting in the applications, and we shall often want a criterion that will enable us to decide whether a given sequence of distributions converges to a distribution or not. The important problem of finding such a criterion will be solved later (cf 10.4); for the present we shall only give the following l)reliminary proposition : A sequence of distributions tvith the distribution functions (x), / 2 W 1 • ■ • converges to a distribution when and only when there is a dis- tribution function P(x) such that Pn{.v}—*P{x) in every continuity point of F[x). — When such a function P(x) exists^ F[x) is the distribution function corresponding to the limiting distribution of the sequence ^ and rre shall briefly say that the sequence converges to the distribu- tion function P'[x], We shall first show that the condition is necessary, and that the limit F(x) is the distribution function of the limiting distribution. Denoting as usual by Pn(^) the probability function corres])onding to F„(x), we thus assume that 7n(S) tends to a probability function P{S) whenever S is a continuity interval (r/, h) for 7-^(S). Denoting by F(x) the distribution function corresponding to P(S), we have to show that F„(x) F(x), where ./■ is an arbitrary continuity point of 7'" (./;). Since P{R^) — 1, we can choose a continuity interval S (a, />) including x such that P(S) > J — £, where £ > 0 is arbitrarily small. Then 1 — t < J^(S) " — F(b) — F(a) ^3 1 — 7'' (a), so that 0 F(a) < e. Further, we have by hypo- tliesis Fn (b) — F„ (rt) ’->F(h) — F(a) > 1 — so that for all sufficiently large n we have F,, (b) — Fn (a) >1 — 2 f , or 0 ^ 7^n (^) < Fn{b) — 1 -f 2 £ 2 £. Since (a, x) is a continuity interval for P (N), we have by hypothesis Fn (x) — — Fn (a) -> P' (x) — P (a). For all sufficiently large n we thus have 1 7’n (x) — ~ 7’ (x) — P], (a) + 7' (a) I <e, and hence according to the above I 7'» (x) — P'(./') I < 3 £. Since £ is arbitrary, it follows that 7^’;,(.x)-> Conversely, if we assume that 7^’„(x) tends to a distribution function /’(./;) in every continuity point of 7'^(x), and if we denote by P(^S) the probability function corresponding to 7^'(x), it immedisCtely follows that Fn(b)- Fn(a)-^F{b)-F{a\ i. e. that Pn{S)- P[S\ whenever S is a half-open continuity interval a<x^b for P(S). Further, since P\x) 59 6.7-8 is never decreasing^ and continuous for x a and x — it follows that Fn [a — 0) -► F{a) and Fn (i^ — 0) F[b), Hence we obtain the same relation Pn{S)-^ P[S) whether the continuity interval aV = (a, h) is re- g-arded as closed, open or half-open. Thus the proposition is proved. Jn order to show by an example that a sequence of distributions may converge without converging io a dintriimtion, we consider first the distribution which has the whole mass unit placed in the single point .r = 0. Denoting the corresponding distri- bution function by t (./•), we have (0.7.1) k (x)— (I for .f < 0, for ./• 0. Then t [Ji' — a) is the distribution function of a distribution whicli has the whole mass unit ploced in the point x — a. Consider iiov^ the sequence of distributions defined by the distribution functions — t ,x — n\ where w = 1, 2, . . .. Obviously this se- quence is convergent according to the above definition, since the mass contained in any finite interval tends to zero as n—®. The limiting set function is, however, identically e((ual to zero, and is thus not a probability function. When n ^ ®, the mass in our distributions disappears, as it were, towards + ® . It might perhaps )>e asked why, in our convergence definition, we sliould not re- quire that F^^ [S)—^P[S)for every Borel set S, It is, however, easily showm that this W'ould be a too restrictive definition. Consider, in fact, the sequcnc<‘ of distributions dedned by the distribution functions i x l/ni, where n=l,2 The th distri- bution in this sequence has its whole mass unit placed in the point ./ — l,/i. Jt is evident that any ren.souablc convergence definition must be such that this se(|uen<*e converges to the distribution defined by where the w'bole mass unit is placed in a* — ' 0. It is easily verified that the convergence definition given above satisfies this condition. If, on the other band, we consider the set S containing the single point ,4' — 0, our sequence gives — 0 for every ?i, while for the limiting distri- bution we have so that does certainly not tend to /^(*S^). Accord- distribution function ^ (.r — l/n tends to t (,/) in every eontinuity point of 6 (./■), i.e. for any x y 0, hut not in the diseontiniiity point x— 0. 6.8. A convergence theorem. — A se([uence of distribution func- tions Fi (,r), F^ (.r), ... is said to be convergent, if there is a non-de- creasingr function F[x) such that J « (.r) (x) in every continuity point of P\x). We then always have 0 ^ F{x) ^ 1, but the example Pni^} = €(x— ?/} considered in the preceding^ ])aragraph shows that F(x) is not necessarily a distribution function. Thus a sequence {jFn(^c)} may be convergent without converging to a distribution function. — We shall now prove the following proposition that will be required in the sequel: Fvery sequence {^^„(a;)} of distribution functions contains a con- vergent sub' sequence. The limit F[x) can always he determined so as to be everywhere continuous to the right. 60 6.8 Let r,,, ... be the enumerable (cf 2.2) set of all positive and negative rational numbers, including zero, and consider the sequence 1\ (/’i), This is a bounded infinite sequence of real numbers, which by the Bolzano-Weierstrass theorem (2.2) has at least one limiting point. The sequence of numbers {/'n(ri)} thus always contains a con- vergent sub-sequence. The same thing may also be expressed by saying that the sequence of functions {/„(./)} always contains a sub-sequence Zi convergent for the particular value .c — . By the same argument, Ave find that contains a sub sequence Z 2 convergent for x =«= r, and for .r — To. Repeating the same procedure, we obtain successively the sub sequences Z^, Z^, . . ., where Zn is a sub-sequence of Z„-i, and Z„ converges for the particular values x= Cj, r^, . . Forming finally the »diagonal» sequence Z consisting of the first member of Z^, the second member of Z 2 , • . , it is readily seen that Z converges for every rational value of x. Let the members of Z be (.r), ‘ • and put lim 7%, (r,) = r, (/ = 1, 2, . . .). 1 *00 "I'hen { 0 } is a bounded sequence, and since every F„^, is a non-de- creasing function, it follows that we have ^ O as soon as n^ri. Now we define a function /'^(.r) by Avriting F(.r) ™ lower bound of r, for all r, > x. It tlien follows directly from the definition that F{x) is a bounded non-decn'asing function of x. It is also easily proved that F{x) is everywhere continuous to the right. We shall now show that in every continuity point of F(x) we have (fi.cS.l) lim (./•) /’(.r), 1 ♦ 00 so that the sub-sequence Z is convergent. If is a continuity point of F{x) we can, in fact, choose h > 0 such that the diflPerence F(x H- h) — F(j' — h) is smaller than any given i > 0. Let r, and Vk be rational points situated in the oi»en intervals {x — //, x) and x Ji) respectively, so that ((>.8.2) F(x — h) ^ Cl ^ F(x) Ck ^ F{x -f //). Further, for every v avc have (fi.8.d) JSj,, ( e,) 7q,^. (./‘) ^ ()1 6 . 8 - 7.1 As V tends to infinity, Fr,^(n) and Fnjrk) tend to the limits rt and Ck respectively. The difference between these limits is, according to (6.8.2), smaller than f, and the quantity F(x) is included between a and a. Since e is arbitrary, it follows that Fn^[x) tends to F(x), Thus the sub-sequence Z is convergent, and our theorem is proved. CHAPTER 7. The Lebesgue-Stieltjes Integral for Functions of One Variable. 7.1. The integral of a bounded function over a set of finite P- measure. — In the preceding chapter, we have seen that the theory of Lebesgue measure given in Ch. 4 may be generalized by the in- troduction of the concept of a general non negative and additive P- measure. We now proceed to show that an exactly analogous gene- ralization may be applied to the theory of the Lebesgue integral developed in Ch. 5. Let us assume that a fixed P-measure is given. This measure may be defined by a non-negative and additive set function P(<S), or by the corresponding non-decreasing point function F[x). We have seen in the preceding chapter that these two functions are ])erfectly equi- valent for the purpose of defining the P-measure. Let further g(x) be a given function of x, defined and bounded for all X belonging to a given set S of finite P-measure. In the same way as in 5.1, we divide S into an arbitrary finite number of parts Sny no two of which have a common point. In the basic definition (5.1.1) of the Darboux sums, we now replace L-measure by P-measure, and so obtain the generalized Darboux sums (7.1.1) ^ P{S.), 1 1 where, as in the previous case, niy and M\. denote the lowei and upper bounds of g(x) in Sv. The further development is exactly analogous to 5.1. The upper bound of the set of all possible ^-values is called the lowei* integral of g{x) over S with respect to the given P-measure, while the lower bound of the set of all possible .^-values is the corresponding upper 62 7.1 integral. As in 5.1 it is shown that the lower integral is at most equal to the upper integral. If the lower and upper integrals are equal, g{x) is said to be mte- grable over S with respect to the given P-measure, and the common value of the two integrals is called the Lebesgue-Stieltjes integral of g(x) over S with respect to the given P^measure, and is denoted by any of the two expressions fg(x)dP{S)^fg(x)dF(x). s s When there is no risk of a misunderstanding, we shall write simply (IP and dF instead of dP(S) and dF{x), Instead of integral or inte- grable with respect to the given Pmeasure^ we shall usually say with respect to P(/S), or tvith respect to F(x), according as we consider the P-measure to be defined by P[S) or by F[x). As long as we are dealing with functions of a single variable, we shall as a rule prefer to use F(x), In the particular case when P(a;) = x, we have P[S)^ L{S), and it is evident that the above definition of the Lebesgue-Stieltjes integral reduces to the definition of the Lebesgue integral given in 5.1. Thus the Lebesgue Stieltjes integral is obtained from the Lebesgue integral simply by replacing, in the definition of the integral, the Lebesgue measure by the more general P-measure. All properties of the Lebesgue integral deduced in 5.2 and 5.3 are now easily generalized to the Lebesgue-Stieltjes integral, no other modification of the proofs being required than the substitution of P- measure for L~measure. Thus we find that, if g{x) is bounded and P-measurable in a set S of finite Pleasure, then g(x) is integrable over S with respect to P(S). For bounded functions and sets of finite P-measure, we further obtain the following generalizations of relations deduced in 5.3: (7.1.2) f W + g^ix))dF = f g^(x)dF f g^{x}dF, S .s (7. 1.3) f eg {x) dF=^ cj g[x) d F, s s (7.1.4) mP(S)^ fg(x)dFSMP(S\ s ' f g(x)dF — f g(x)d F + f g (x) d F, Si + Sg Si 63 (7.1.5) 7.1 (7.1.()) |y (i{x)dF\ ^ j \g{x)\dF, S »s where c is a constant, vi and M denote the lower and upper bounds of g{x) in S, while S, and are two sets without common points. It follows from (7.1.4) that the integral of a bounded function over a set of P-measure zero is equal to zero. Thus the value of an integral is not afiFected if the values of the function g (j?) are arbitraril}^ changed over a set of P-measure zero. We also have the following proposition generalizing (5.3.6): If the sequence {gi{x)\ is uniformly bounded in S, and if \im gr[x) ^ g {x) **-►00 exists almost everywhere (P) in S, then (7. 1 .7) lim / {x) (IF == f g (t) r/P. The analogous generalizations of (5.3.7) and (5.3.8) are obtained in the same way as in 5.3. If (\ and Cjj are non-negative constants, we easily deduce the fol- lowing relation, which has no analogue for the Lebesgue integral: j 9 (a;) (I (c, A’, + fa 7' a) f , j fir (x) d A’, + Cg / g (x) d A'g . .S .S' In the particular case when the set S consists of a single point .r,), we obtain directly from the definition fgix)dF=g(xJJ>(.r==x„). Consider now the case when F(x) is a step-function (cf 6.2) with steps of the height pv in the points x = Xry and denote the set of all points by X. Using the fact that the integral over a set of P- measure zero is equal to zero, and the generalization of (5.3.8) men- tioned above, we then obtain (7.1.8) j g{x)dF--= j g{x)dF—'2i j 0 \x)dF ^ '^p^g(x,). In the further particular case when g{x)^\y we have / dF = f dP=P(S). 64 7.1-2 We shall often have to consider integ^rals, where the function g{x) is complex-valued, say g[x) = a(ar) + ih(x), where a(x) and h(x) are real and bounded in S. We then define the integ^ral by writings fg{x)dF = f a{x)dF + i / b{x)dF. ts a s All properties deduced iabove extend themselves easily to integrals of this type. For the relation (7.1.6), this extension is a little less ob- vious than in the other cases, and will be shown here. Put f g{x)dF=re*'’, where r and v are real, and r ^ 0. The real part of the quantity 1^(^)| — 6“”’^ (a;) is always SO. Consequently the real integral f (\g(j:)\ — e-‘'' g(x)) dF= f \g{x)\dF — r .S h = / 1^ W I rf r*’ — \f g (x) d f\ iS >• is SO, and this is equivalent to (7.1.6). 7.2. Unbounded functions and sets of Infinite P-measure. — The extensions of the Lebesgue integral treated in 5.4 and 5.5 may be ap- plied in a perfectly analogous way to the Lebesgue-Stieltjes integral. In fact, every word and every formula of 5.4 and 5.5 hold good, if Lebesgue measure is throughout replaced by P-measure, and Lebesgue integrals are replaced by Lebesgue-Stieltjes integrals with respect to P(S) or F(x). Thus g(x) is called integrahle with respect to P[S) — or P'[x) — over a set S of finite P-measure, if the limit (cf 5.4.1) lim j ga,b(x)dP = f g(x)dP = j g(x) d F h~^ + co exists and has a finite value. If this is the case, \g{x) \ is also integrable with respect to P over S. Further, when S is of infinite P-measure^), g{x) is called integrable with respect to P — or P — over S, if (cf 5.5) g(x) is integrable *) In the case of a bounded P(S) (e, g. when P(S) is a probability function, cf 6.6) there are, of course, no sets of infinite P-measure. 65 6 — 464 H. Cramir 7.2-3 with r68pect to JP — or 7^ — over Sa,b for all a and and if the limit lira f]g{x)\dP — f \g{x)\dP = j \g(x)]dF Kn h « S exists and has a finite value. If this is the case, the limit (cf 5.5.1) (7.2.1) lim a-* —00 ft — +00 fg(x)dP=^fg(x)dP^ f g{x)dF ^a,h ^ also exists and is finite, and we shall accordingflj say that the Le- besg^ue-Stieltjes integral of g[x) with respect to P — or P — over the set 8 is convergent^). The limit (7.2.1) is then, by definition, the value of this integral. — If \g(x)\ < G[x), where G(x) is integrable, then g{x) is itself integrable. The properties (7.1.2) — (7.1,6) of the Lebesgue-Stieltjes integral hold true for any functions integrable with respect to the given P-measure. In the case of a set 8 of infinite P-measure the relation (7.1.4) should, however, be replaced by f g{x)dF^{) if ( 5 f (x) ^ 0 for all x in 8. s We finally have the following generalization of the proposition expressed by (7.1.7): If Urn gv{x) = g{x) exists almost everywhere (P) in the set 8 of finite or infinite P-measure^ and if \gv{x)\<. G (x) for all v and for all x in 5, where G (x) is integrable with respect to F over 8, then g(x) is integrable with respect to F over 8, and (7.2.2) lim V-* QO fgAx)dF=fg(x)dF. S S The generalization of the above considerations to the case of inte- grals with a complex- valued function g(x) is obvious. In the particular case when P^x) = x all our theorems reduce, of course, to the corresponding theorems on ordinary Lebesgue integrals. 7.3* Lebesgue-Stieltjes integrals with a parameter. — We shall often be concerned with integrals of the type «W = / t)dF{x), S ‘) With respect to the terminology, the same remark should be made here as in the case of (6.6.1). 66 7,3 where ^ is a parameter, while S is a given set of finite or infinite P-measure. We shall require certain theorems concerning continuity, differentiation and integration of such integrals with respect to t. In the particular case when F{x)=^x^ these theorems reduce to theorems on Lebesgue integrals. We assume that g(x, t) is complex- valued and that, for every fixed t that will be considered, the real and imaginary parts are JB-measur- able functions of x which are integrable over S with respect to F{x), By Gi(x), G^(x)y . . ., we denote functions which are integrable over /S with respect to F(x). I) Continuity. — If, for almost all (P) values of x in S, the func- tion g[x, t) is continuous with respect to t in the point ^or all t in some neighbourhood of t^, we have \g(x,t) \ < Gi{x), then u{t) is continuous for t = so that we have^) (7.3. 1 ) lim fff (x, t)dF(x) = j g (x, Q d F (x). t— <0 *s* s This is a direct corollary of (7.2.2). For any sequence of values In tfy . . M belonging to the given neighbourhood and tending to t^, the conditions of (7.2.2) are, in fact, satisfied if we take ^^(a?) ~ g{x^ U) and g{x) = g(x, t^). Thus by (7.2.2) we have u(U) u(to), and it fol- lows that the same relation holds when t tends continuously to — When the conditions of 1) are satisfied for all t^ in the open interval {a, b), it is seen that u(t) is continuous in the whole interval. II) Differentiation. — If, for almost all (P) values of x in S and for a fixed value of t, the following conditions are satisfied: Q Q ^ 1) The partial derivative — exists, 2) We have | <(?, (x) /or 0 < | A| < A*, where is independent of x, then (7.3.2) = 9 {x,t)dF(x) = f^-^^dF{x). S S Like the preceding proposition, this is a direct corollary of (7.2.2). For any sequence hi, h ^, where \K\ <ho and K tends to sero, the conditions of (7.2.2) are satisfied if we take *) The theorem holds, with the same proof, even if fo is replaced by + « or — » . 67 7.3 and g(x) = ^J d g (.r, i) dt Thus «(< + /i.) — «W _ J g{x ,t + h) — g(x, J ^_9(po,^ S -S' 80 that the derivative u {t) exists and has the value given by (7.3.2). We remark thaf^ if the partial derivative exists and satisfies the d ijc condition ^ ^ interval («, ft), it follows from the relation g{x,t+h) — g{.r,t) = h{^^\ , (0<e<l), \Ot ft+oh that (7.3.2) holds for all t in («, ft). Note that the condition 2) of II) is not satisfied e. g. if we take F(jir) =■■= .r, S (— oo^ + oc), and { for .r ^ t. [o for X < f. In this case ^>e inave H (f/ — fg[j’,td.r~ f e} ^d.r~ly — 00 t and the application of (7.3.2) would give which is obviously false. The correct way of calculating u\t) is here, of course, to take account of the variable lower limit of the integral, thus obtaining u f ^ dx — 1 =0. Ill) Integration. — If for almost all (P) values of x in S, the func- tion g [Xy t) is continuous with respect to t in the finite open interval (a, ft) and satisfies the condition |^(x, ^)|< G^[x) for all t in [ayh)y then (7.3.3) h h f u{t)dt = f[f g (xy t) d jP(.r)] d t n a S b = flf 9 (fJ. t)di\d F{x). S a 68 7.3 Further^ if the above conditions are satisfied for every finite interval OD {a,b) and if in addition^ we have f \g{x,t)\dt < Gq(x), then^) — 00 00 00 (7.3.4) j u{t)dt = f[f g{x,t)di\d F{x). — 00 — 00 We consider first the case of a finite interval (a, b). For almost all (P) values of x in S, the integral t h{x,t) f g (x, r)dx a dhix i) has, by (5.1.4), for all t in («, ft) the partial derivate — so that we have [x). Further | A (x, ^) | < (6 — a) 64 (x), so that h(x,t) is integrable over S with respect to F(x). Writing v{t) = f h [x, f) d F[x), s we may now apply the remark to theorem II), and find 'v' (t)^ j g (x, 0 d F(x) = M (#). ,S By I), the function u (t) is continuous in (a, i!/), so that the difference f z/(^) = J u(T)dT — V (f) a has a derivative z/' (t) = m (^) — v' (^) = 0. For t a^ have h (ar, a) = 0, v(a) = 0, and thus ^(a) = 0. It follows that jJ(t) = 0 for a ^ t ^b, and thus in particular ^{b) = 0y which is identical with (7.3.3). When the conditions of the second part of the theorem are satis- fied, (7.3.3) holds for any finite (a,t), and we have ') It is evident bow the conditions should be modified when we want to integrate u(J) over (a, 00 ) or (— «, b). 69 7.3-4 t) I ^ /-’(x)] dt=^f[j\g{x,t)\dt]d F{x) a a S Ha ^ f G,{x)dF{x). s 00 Thus the integral \u[t)\dt is convergent. If, in the relation (7.3.3), — 00 we allow a and h to tend to — co and ■+- oc respectively, it follows that the first member tends to the first member of (7.3.4). An app- lication of (7.2.2) shows that, at the same time, the second member of (7.3.3) tends to the second member of (7.3.4). Thus (7.3.4) is proved. The theorems proved in this paragraph show that, subject to certain conditions, analytical operations such as limit passages, differ- entiations and integrations with respect to a parameter may be per- formed under a sign of integration. 7.4. Lebesgue-Stieltjes integrals with respect to a distribution. — If P{S) is the probability function of a distribution (cf 6.6), the integral OD oo (7.4,1) f ff(x)dP = j <j(x)dP = j ff(x)dP' Hi — oc — 00 may be concretely, though somewhat vaguely, interpreted as a weighted mean of the values of g{x) for all values of x, the weights being furnished by the mass quantities dP or d F situated in the neigh- bourhood of each point x. The sum of all weights is unity, since we have 00 00 fdP = / rf/'’=P(R,)= 1. — 00 —00 Every bounded and 1^-measurable g (ir) is integrable with respect to P (or F) over (— oo). If the mass distribution is represented as the sum of two com- ponents according to (6.6.2), the integral (7.4.1) becomes 00 00 00 j g{x)d F = Cl f g (x) d I<\ + c^f g{x)d — 00 — 00 — 00 where the first term of the second member reduces to a sum over the discrete mass points of the distribution, as shown in (7.1.8). 70 7.4-5 If, for a positive integer v, the function is integrable with respect to F(x) over (— Qo, oo), the integral OD Qv — / x^ dF (x) — 00 is called the moment of order P, or sinaplj the v:th moment, of the distribution, and we say that the v:th moment exists. It is then easily seon that any moment of order v <.v also exists. It is known from elementary mechanics that the first order moment is the abscissa of the centre of gravity of the mass in the distribu- tion, while the second order moment represents the moment of inertia of the mass with respect to a perpendicular axis through the point x = 0. — The moments of a distribution will play an important part in the applications made later in this book. If, for some k> 0, the distribution function F{x) satisfies the conditions (with respect to the notations, cf 12.1) F{x) = O (| a* when » — — » , 1 — F{x) =» 0(x^^) when x -►+ then any moment of order v < k exists. In order to proTe this, it is according to 7.2 sufficient to show that the integral of |x|^ with respect to JF(x)over an interval (a, b) is less than a constant independent of a and b. Now we have by hypothesis 2* / I x h d F(x) ^ 2^^(F(2r) - F(2r-i)) ^ 2'’'»'(1 - F ( 2r - 1 ))< C 2r(k-vy where C is independent of r, and a similar relation for the integral over (— 2’’, — 2’'~i). Summing over r 1, 2, . . . and adding the integral over (— 1, 1), which is ^ 1, we find for any interval (a, b) u and thus the v:th moment exists. 7.5. The Rlemann-Stleltjes integral. — Consider the Lebesgue- Stieltjes integral (7.5.1) f g(x)dF{x) I in the particular case when I h sl finite half-open interval 71 7,5 while g (ic) is continuous in I and tends to a finite limit as a: a + 0. We divide / in w sub-intervals u = (xv-i <x ^Xv) by means of the points a = a^o < < • • * < a?,! = t and consider the Darboux sums (7.1.1) which correspond to the divi- sion 7 = H F in- We then obtain (7.5.2) 1 rriv and Mv being the lower and upper bounds of g[x) in u. Now let £ > 0 be gfiven. By hypothesis we can then find d such that M, — niv < € as soon as Xv — Xv-i < S. Choosing n and the Xv such that Xv'-’Xv-i < d for all r, we then have Z-^<elF(/j}-F(a}]. Thus when n tends to infinity, and at the same time the maximum length of the sub-intervals u tends to zero, Z and ^ tend to a common limit which must be equal to the integral (7.5.1): ft (7.5.3) \im Z==lims = f g (.r) d F(x). n— 00 00 Thus in the particular case here considered the simple expression (7.5.2) of the Darboux sums is sufficient to determine the value of the Lebesgue-Stieljes integral. If we put F{x) = x, these expressions become identical with the Darboux sums considered in the theory of the ordinary Biemann integral. Accordingly, the integral defined by (7.5.3) is called a Riemann-Stieltjes integral. It follows from the above that, when this integral exists, it always has the same value as the corresponding Lebesgue-Stieltjes integral. If, in every sub-interval iv, we take an arbitrary point we obviously have (7.5.4) lim y S' il) U''M ~ •^’(aJr-i)] = f g(x) d F(x), n-^ 00 1 a since the snm in the first member is included between z and Z. 72 7.5 The Riemann-Stieltjes integ^ral (7.5.3) exists even in the more g^eneral case when g{x) is bounded in {a,h) and has at most a finite number of discontinuity points provided that F{x) is continuous in every We can, in fact, then surround each n by a sub-interval which gives an arbitrarily small contribution to the sums z and Z, In the particular case when F(x) is continuous everywhere in (a, h) and has a continuous derivative F' {x), except at most in a finite number of points, we have for every iv not containing any of the exceptional points F[x^) — F(xv-i) = (xr — x^^i) JF' (f,.), where is a point belonging to i^. By means of (7.5.4) it follows that in this case the integral (7.5.3) reduces to an ordinary Riemann integral: b b (7.5.5) fg{x)d F(k) = / <? (x) F' (x) dx. a a All these properties immediately extend themselves to the case of a complex-valued function g{x), and also to infinite intervals {a.i) subject to the condition that g (x) is integrable over (a, b) with respect to F{x). If this condition is satisfied, we have e. g. the following generalization of (7.5.4): (7.5.6) lim 2 y - F(x.^^)] ^fg(x)d F{x), n-*- « •' 1 -00 where as before the maximum length of the sub-intervals (arr-i, xv) tends to zero as n-^oo, while at the same time x ^-^ — oo and 00 . Suppose now that two non-decreasing functions F{x) and G{x) are given, which are both continuous in the closed interval (a, b), ex- cept at most for a finite number of discontinuity points, which are all inner points of (a, t). We further suppose that no point in (a, b) is a discontinuity point for both functions F and G. Choosing the sub-intervals so that no Xv is a discontinuity point, we then have F(h) G {b) - F(a) G(a) = 2 ® ~ ^ 1 - 2 FM {G {x.) ~ G {x.^,)] -f 2 6? (x.-i) [F{xr) ~ F{x .^,)\ . 1 1 73 7*5 The two terms in the last expression are included between the lower and upper Darboux sums corresponding to the integrals ^ FdG and j GdF respectively. Passing to the limit, we thus obtain the formula of partial integration : h h h (7.5.7) f d(FG) = f Fda + / GdF. n a a Finally, we consider a sequence of distribution functions (cf 6.7) Fi(x), F^(x), . . ., which converge to a non-decreasing function F(x) in every continuity point of the latter. (By 6.7, the limit jF ( a:) is not necessarily a distribution function.) Let ^^(a:) be everywhere continuous. For any finite interval (a, 6) such that a and b are continuity points of F(x), an inspection of the Darboux sums that determine the inte- grals then shows that we have h h (7.5.8) lim fg(x)d F„ (x) = fg (a;) dF(x). a a Suppose further that, to any « > 0, we can find A such that fhix) I d Fn (x) -I / 1 <; (a:) I rf F„ (x) < s -00 A for w = l, 2, . . . We may then always choose A such that F(x) is continuous for x = A, and by means of (7.5.8) we find that / g{x)\dF„{x)-* f \g{x)\dF{x) A A where B > A is another continuity point of F(x), Thus the last integral is ^ £ for any B> A, and for the integral over (— B, — .*4) there is a corresponding relation. It follows that g[x) is integrable over ( — GO, x>) with respect to jF(a;). If, in (7.5.8), we take a=^—A and 6 = + -4, each integral will differ \)j at most 2 e from the cor- responding integral over (— oo, oo). Since e is arbitrary, we then have (7.5.9) lim n-^oo 00 00 j g(x)d F'n (x) = f g (x) d F'(x). — 00 — 00 This relation is immediately extended to complex-valued functions g(x). 74 7.5 References to chapters Ar^l , — The classical theory of integration received its final form in a famous paper by Riemann (1854). About 1900» the theory of the measure of sets of points was founded by Borel and Lebesgue, and the latter intro- duced the concept of integral which bears his name. The integral with respect to a non-decreasing function Fi^x) had been considered already in 1894 by Stieltjes, and in 1913 Radon (Ref. 205) investigated the general properties of additive set functions, and the theory of integration with respect to such functions. There are a great number of treatises on modern integration theory. The reader is particularly referred to the books of Lebesgue himself (Ref. 23), de la Valine Pous- sin (Ref. 40) and Saks (Ref. 23). I)e la Vallee Poussin gives an excellent introduction to the theory of the Lebesgue integral, and contains also some chapters on additive set functions, while the two other books go deeper into the more difficult parts of the theory. 75 Chapters 8-9. Theory of Measure and Integration inR^. CHAPTER 8. Lebesgue Measure and other Additive Set Functions in R „. 8.1. Lebesgue measure in iln. — The elementary measure of extension of a one-dimensional interval is the length of the interval. The corresponding measure for a two-dimensional interval (cf 3.1) is the area, and for a three-dimensional interval the volume of the interval. Generally, if / denotes the finit(‘ n-dimensional interval defined by the inequalities ay S Xv ^ hv (y ----1,2,..., n), we shall define the n-dmensional volume oj the interval i as the non- negative <|uantity — ]i(^. — «.)■ I For an open or half- 0 ])en interval with the same extremes n, and the volume will be the same as in the case of the closed interval. A degenerate interval has always the volume zero. For an infinite non-degenerate interval, we ]mt L[t) -f- X’ . The Borel lemma (cf 4.1) is directly extended to n dimensions, and by an easy generalization of the proof of (4.1.1) ^ve find that L[i) is an additive function of the interval. In the same way as in 4.2, we now ask if a measure witli the same fundamental properties as L(i) can be defined even for a more general class of sets than intervals. — We thus want to find a non- negative and additive set function L{S). defined for all Borel sets S in ilrt, and taking the value L{i) as soon as S is an interval In 4.3 — 4.7, we have given a detailed treatment of this problem in the case 1, and we have .seen that there is a unicjue solution, viz. the Lebesgue measure in Kj. The case of a general n requires no modification whatever. Every word and every formula of 4.3 — 4.7 bald true, if linear sets are throug’hout replaced by ; 2 -dimensional ones, and the length of a linear interval is replaced by the w-diinen- sional volume. It thus follows that thert is a non- negative and additive set function L (5), uniquehf defined, for all Hovel sets S in Rn and such that^ in the particular case when S is an interval, L[S) is equal to the n-dimensional volume of the interval. h[S) is called the n-dimensional Lebesgue measure^) of S. 8.2. Non-negative additive set functions in Rn. — In the same way as in the one-dimensional case, we may also for n> \ consider non-negative and additive set functions P{S) of a more general kind than the ^-dimensional Lebesgue measure L[S). We shall consider set functions P{S) defined for all Borel sets S in Rn and satisfying the conditions A) — C) of 6.2. It is immediately seen that these conditions do not contain any reference to the number of dimensions. The relations (6.2.1) — (6.2.3) then obviously hold for any number of dimensions. With any set function P[S) of this type we may associate a point function F[x) ^ F(;x^, . . r/;„), in a similar way as shown by (6.2.4) for the one-dimensional case. The direct generalization of (6.2.4) is, however, somewhat cumbersome for a general w, and we shall content ourselves to develop the formulae for the particular case of a hounded P{S), where the definition of the associated point function may be siin]dified in the way shown for the one-dimensional case by (6.,5.1). This will be done in the following paragraph. As in the case n J, any non-negative and additive set function P[S) in Rn defines an //-dimensional P-measure of the set which constitutes a generalization of the //-dimensional Lebesgue measure L (aS^). The remarks of 6.4 on sets of /^-measure zero apply to sets in any number of dimensions. *) In order to be quite precise, we ought to adopt a notation showing explicitly the number of dimensions, e. g. by writing instead of L(S). There should, however, be no risk of misunderstanding, if it is alw^ays borne in mind that the measure of a given point set is relative to the spaee in which it is considered. Thus if we consider e. g. the interval 0, 1) on a .straight line as a set of points in JRi. its (one-dimensional) measure has the value 1. If, on the other hand^ we take the line as j- axis in a plane, and consider the same interval as a set of points in JI 2 , "e are concerned with a degenerate interval, the 'tw’o-dimensional) measure of which is equal to zero. 8.3 8.3. Bounded set functions. — When P(Rn) is 6nite, we shall say (cf 6.5) that P(S) is hounded. We have then always P(S)^P(Rn)- For a bounded P{S) we define, in gfeneralization of (6.5.1): (8.3.1) F{x) = P(x„ . . Xn) = P(5i ^ . . ., ^ Xn), Evidently F[x) is, in each variable Xv, a non-decreasing; function which is everywhere continuous to the right, and we have for all * (cf 6.5.2) 0 < F(x) ^ P(R„). In the one-dimensional case, the value of P(S) for a half-open interval defined by a < x^a h is, by (6.2.5), given by a first order difiEerence of F[xY P[i,) = JF[a) = F[a + A) ~ F[a), This formula may be generalized to the case of an arbitrary n, Con> sider first a set function P(<S') in R 2 ) ^ two-dimensional interval defined by ^ -f- Aj, a^< -¥ h^. We then have (8.3.2) P(i,)=^J,F{a,,a,) = F{ai + Ai, + h^) — F(a,, + h^) — F(ai -f Aj, a*) + F(ai, a*). This will be clear from Fig. 2. If ilfj, . . are the values assumed by P(iS) for each of the rectangular domains indicated in the figure, the additive property of P(5') gives = (M, + Jtf, + Jfs + itfj - (Jfif, + M,) - (M, + M^) + M,, and according to the definition (8.3.1) of F(x), this is identical with (8.3.2) . (tti + hi , a2 f- hi) Fig. 2. Set functions and point functions in Ag. 78 8.3 The generalization to an arbitrary n is immediate. If P(S) is a set function Rn, and if u is the half-open interval defined by tiv < ./‘r ^ -t- hr (oT v = 1 , 2, . . . , 7?, we have Pi^n) ^ = + Aj, . . tin + A«) (8.3.3) —F[a^,a^ Un + -f — F[a^ + Aj, . . ., Qn^i + Aw-i, an) + (— \YF{a^, . . ., ^/u). To any bounded P[S) in Rn, there thus corresponds a point function F[xy, . . ., x-ii) which, in each Xv, is non-decreasing and continuous to the right, and is such that the n:th diflPerence ^nF as defined by (8.3.3) is always non-negative. — Conversely, a generalization of the argu- ment of 6.3 shows that any given F with these properties uniquely determines a set function P{S) satisfying the conditions A) — C) of 6.2, which for any interval in assumes the value given by (8.3.3). When one of the variables in F, say Xr, tends to — oo , while all the others remain fixed, it is shown as in 6.5 that F tends to zero. Similarly, when all the Xv tend simultaneously to -\r ^ , F tends to P{R,.). ■ When all the variables in P' except one, say Xr, tend to -h oc , F' will tend to a limit, which is a bounded non-decreasing function P\(xr) of the remaining variable Xv. By 6.2, the function Fv(xr) has at most an enumerable number of discontinuity points /v, si, . . • Let us consider these as excluded values for the variable Xr, which is thus only allowed to assume values different from zl, zl, ... In the same way, each variable . . ., Xn has its own finite or enumerable set of excluded values. — For any non-excluded point x {x^, . . x„), the function F' is continuous. This follows from the inequality I F{x -f- fc) — F{x) I ^ F'[x -f I fc|) F"[x — I fc I) ^ n f S 2 (PriXv + I hr |) /'r IJ-V | hr |)), 79 8.3-4 where fc = (//j, . . K) is an arbitrary point, while |fc| denotes here the point (|A, |A»|), and the sums and differences :r + A etc. are formed according to the rules of vector addition (cf 11.1 — 11.2). An inspection of Fi^. 2 will help to make this inequality clear. An ^/-dimensional interval such that none of the extremes and ht is an excluded value for the correspondingf variable Xv is called a continuity interval of J^(S). The value assumed by P(iS) when S is a continuity interval will obviously change in a continuous way for small variations in the Gv and />,. If two bounded set functions in Rn agree for all intervals that are continuity intervals for both, it follows (cf 6.7) that the set functions are identical. 8.4. Distributions. — Non-negative and additive set functions P{S) such that P[Rn) = 1 play, like the corresponding one-dimensional func- tions (cf G.6), a fundamental part in the applications. By the preceding paragraph, the point function F[x) associated with a set function P[S) of this class satisfies the relations (H.4,1) F{x) F{.r,, . . ., Xn) = P(i'l ^ ! O^F(x)^ 1 , ./nF^:0, F{— 30 Xn) ^ = F(Xi, . . ., Xn-h — ^) = 0. r(+ CO, . . + 30) = 1. As in the one-dimensional case, the functions P[S) and F{x) will be interpreted by means of a disiribuHon of a unit of over the space Rn, such that every Borel set S carries the mass P[^). As in 6.6, we are at liberty to define the distribution either by the set func- tion P[S) or by the corresponding point function i'’(A:), which represents the quantity of mass allotted to the infinite interval £ Xj, . . ., ^ .r„. The difference between these two equivalent modes of definition is. of course, only formal, and it will be a matter of convenience to decide which of them should be used in a given case. — As in 6.6, P[S) will be called the prohahility function, and P(jc) the distribution function of the distribution. Thus a distribution function is a function F[x] F[,i\, . . Xn) which, in each Xv, is non-decreasing and everywhere continuous to the right, and is such that the //:th difference as defined by (8.3.3) is al- ways non-negative. Conversely, it follows from the preceding paragraph that any given F with these properties is the distribution function of a uniquely determined distribution in 80 8.4 If the set which consists of the single point « « o carries a posi- tive quantity of mass, a is a discrete mass point of the distribution. The set of all discrete mass points of a distribution is enumerable, as we find by a direct generalization of the corresponding proof in 6.2. Obviously any discrete mass point o is a discontinuity point for the distribution function I\ In the case n = 1 we have seen in 6.6 that, conversely, F is continuous in all points x except the discrete mass points. This is generally not true when n > 1. In fact, in a multi dimensional space the mass may be distributed on lines, surfaces or hypersurfaces in such a way that there is no single point carrying a positive quantity of mass, while still F may be discontinuous in certain points. In the preceding paragraph we have, however, seen that it is possible to exclude certain values for each variable so that the function F will be continuous in all »non>ezcluded» points. Consider e. g. a distribution of a mass unit with uniform density orer the inter- ▼al (0, 1) of the j^s-axis in the plane of the variables Xi, x^. Obviously this distribu- tion has no discrete mass points, and still the corresponding distribution function F{xiyX 2 ) is discontinuous in every point (0, 2 rt) with jtj > 0. Accordingly it will be seen that the function Fi (xi) = lim Fixi^ x^) discussed in the preceding paragraph z,-. + ao is here discontinuous for Xi = 0, which is the only ^excluded* value for j*|. For x% there are no excluded values, and accordingly F{xitX%) is continuous in any point .ri,.r 2 ; with Xi ¥= 0. We further see that any distribution in Rn can be uniquely re- presented in the form (6.6.2), as the sum of bwo components, the first of which corresponds to a distribution with its whole mass concen- trated in discrete mass points, while the second component corresponds to a distribution without discrete mass points. It follows from the above that, when n> 1, we cannot assert that the distribution func- tion of the second component is everywhere continuous. Let I denote the 92-dimensional interval defined by Xv /iv "t" hy for V = 1 , 2, . . ., ? 2 . The ratio P(i)_ JnF L(l) ¥^hih, ~hn where the difference F is defined as in (8.3.3), represents the average density of the mass in the interval L If the partial derivative f(Xi, . . Xn) = dXidx^ ' dXn 81 6 — 454 H. Cramir 8.4r-5 exists, the average density will tend to this value as all the tend to zero, and accordingly /(x,, . , ., Xn) represents the density of mass at the point x. As in the one-dimensional case, this function will be called the probability density or frequency function of the distribution. hetF[x^, . . Xi) be the distribution function of a given distribution. When all the variables except Xv tend to + «> , F will (cf 8.3) tend to a limit Fv(xv) which is a distribution function in a:,.. We have, e. g., 4- Qo, . . -f oo). The function Fv[x<v) defines a one- dimensional distribution, which will be called the marginal distribution of Xv. We may obtain a concrete representation of this marginal distribution by allowing every mass particle in the original n-dimen- sional distribution to move in a direction perpendicular to the axis of until it arrives at a point of this axis. When, finally, the whole mass is in this way projected on the axis of Xv, a one-dimensional distribution is generated on the axis, and this is the marginal distri- bution of x^. Each variable Xv has, of course, its own marginal distri- bution, that may be different from the marginal distributions of the other variables. Let us now take any group of k < n variables, say Xk, and allow the n — k remaining variables to tend to 4 x . Then F will tend to a distribution function in . . ., Xk, which defines the hdi~ mensional marginal distribution of this group of variables. The distribu- tion may be concretely represented by a projection of the mass in the original la-dimensional distribution on the Ar-dimensional subspace (cf 3.5) of the variables a:j, . . ., Xk. — Let P be the probability func- tion of the n-dimensional distribution, while Pi,. . , k is the probability function of the marginal distribution of ajj, . . ., Xk. Let, further, S* denote any set in the A:-dimensional subspace of arj, . . ., Xk, while S is the cylinder set (cf 3.5) of all points * in Rn that are projected on the subspace in a point belonging to S\ Obviously we then have (8.4.2) Pi, ..,,(«') --=F(«), which is the analytical expression of the projection of the mass in the original n-dimensional distribution on the ^r-dimensional subspace of the variables a:,, . . ., a;*. The theory of distributions in Rn will be further developed in Chs. 21—24. 8.5. Sequences of distributions. — As in the one-dimensional case (cf 6.7), we shall say that a sequence of distributions in Rn is con^ 82 8.5-6 t>ergent, when the corresponding probability functions converge to a non-negative and additive set function P(S), in every continuity inter- val of the latter. If, in addition, the limit P(S) is a probability func- tion, i.e. if P(Rn) = 1, we shall say that the sequence convei^ges to a distnhution. From the point of view of the applications, it is generally only the latter mode of convergence that is important. For n sequence which is convergent without converging to a distribution, we have P{Rn \ < 1, which may be interpreted (cf the example discussed in 6.7) by saying that a certain part of the mass in our distributions »escapes towards infinity^ when we pass to the limit. A straightforward generalization of 6.7 will show that a sequence of distributions converges to a distribution when and only when the corresponding distribution functions JP,, . . . tend to a distribu- tion function F in all »non-excluded» (cf 8.3) points of the latter. A further criterion for deciding whether a given sequence of distribu- tions converges to a distribution or not will be given in 10.7. As in 6.8, we shall further say that a sequence of distribution functions Pj, ... is convergent, if there is a function F, non- decreasing in each Xv, such that Fn F in every »non-excluded» point of F. We then always have O^F^ 1, but according to the above F is not necessarily a distribution function. We then have the fol- lowing generalization of the proposition proved in 6.8 for the one- dimensional case: Every sequence of distribution functions contains a convergent sub-sequence. — This may be proved by a fairly straight- forward generalization of the proof in 6 8, and we shall not give the proof here. 8.6. Distributions in a product space. — Consider two spaces Rm and Rn, with the variable points x (or,, . . Xm) and y = (^,. . . ., respectively. Suppose that in each space a distribution is given, and let P, and F, denote the probability function and the distribution function of the distribution in Rm, while Pg and Fg have the analogous significance for the distribution in Rn. In the product space (cf 3.5) Rm • of 4- w dimensions, we denote the variable point by « = (x^y) = (xj, . . ., Xm, yi, . . yn)- If and Sg are sets in Rm and Rn respectively, we denote by S the rectangle set (cf 3.5) of all points ar = (*,y) in the product space such that x c: 5^ and y < Sg. It is almost evident that we can always find an infinite number of distributions in the product space, such that for each of them the 83 8.6 marginal distributions (cf 8.4) corresponding^ to the subspaces Rm and Rn coincide with the two given distributions in these spaces. Among these distributions in the product space we shall particularly note one, which is of special importance for the applications. This is the distribution given by the following theorem. Thei'e is one and only one distribution in the product space Rm*Rn such that ( 8 . 6 . 1 ) P{S)=^P^{S,)PM for all rectangle sets S defined by the relations * < and y <i 6^. This is the distribution defined by the distribution function 8.6.2) F(z)=^T\(x)f\[y) for all points z = [x^y). We first observe that F{x) as given by (8.6.2) is certainly a distri- bution function in Rm • Rn, since it satisBes the characteristic properties of a distribution function given in 8.4. Consider now the distribution deHned by F{x). By means of (8.3.3) it follows that we have for any half open interval defined by inequalities of the type Ov < Xv ^ c^ < y^ ^ dp. Now any Borel set Si may be formed from intervals /j by repetitions of the operations of addition and sub- traction. (By (1.3.1), the operation of multiplication may be reduced to additions and subtractions.) By the additive property of P,, it follows that for any rectangle set of the form S = {Si,l^) we have P{s)==Pi(s,)PM and finally we obtain (8.6.1) by operating in the same way on inter- vals /g. — On the other hand, any distribution satisfying (8.6.1) also satisfies (8.6.2), the latter relation being, in fact, merely a particular case of the former. Since a distribution is uniquely determined by its distribution function, there can thus be only one distribution satisfying (8.6.1). If, in (8.6.1), we put S^ = Rn, it follows from (8.4.2) that the mar- ginal distribution corresponding to the subspace Rm coincides with the given distribution in this space, with the probability function P,. Similarly, by putting = Rm, we find that the marginal distribution in Rn coincides with the given distribution in this space. 84 8.6 9.1 We finally remark that the theorem may be gfeneralized to distribu- tions in the product space of any number of spaces. The proof is quite similar to the above, and the relations and (S.6.2) are replaced by the obvious generalizations P - P, P, . . . Pk and F - F, F, . . . F,. CHAPTER 9. The Lebesgue-Stieltjes Integral for Functions OF n Variables. 9.1. The Lebesgue-Stieltjes integral. — The theory of the Lebesgue- Stieltjes integral for functions of one variable developed in Ch. 7 may be directly generalized to functions of n variables. If, in the expres- sions (7.1.1) of the Darboux sums, we allow J^{S) to denote a non- negative and additive set function in Rn, while mr and J/,. are the lower and upper bounds of a given function g{x)=^ • * •» -'«) in the ^/-dimensional set 6\, the Lehesgue-Stieltjes integral ) ! <j[x)<J P j (/(x, x „) d P s s is defined in the same way as in the one-dimensional case. The function 17 (x) is said to be P-measurable in the set if the subset of all points x in S such that g(x) ^ /• is a Borel set for every real value of All remarks on P-measurable functions given in 5.2 extend themselves without difficulty to functions of n variables. If g(x) is bounded and JB-measurable in a set S of finite P-measure, it is integrable over S with respect to P. The definitions of integral and integrability in the case of an unbounded function g(x), and a set S of infinite P-measure, require only a straightforward generaliza- tion of 7.2. All properties of the integral mentioned in 7.1 — 7,3 readily extend themselves to the case of n variables, all proofs being strictly analogous to those given in the case w = 1 . In the particular case when P{S) is the w-dimensional Lebesgue measure L(S)y we obtain the Lebesgue integral of the function g { x ), which is also often written in the ordinary multiple integral notation; H g{x) (I L=- f g(Xi, . . ., Xn)(1Xi . . . (lx„. *s .s 85 9.1-2 If S is an interval, and g(x) is integrable in the Riemann sense over the interval, the Lebesgue integral coincides with the ordinary multiple Riemann integral, as we have observed for the one’dimen- sional case in 5.1. 9.2. Lebesgue-Stleltjes integrals with respect to a distribution. — The remarks made on this subject in 7.4 evidently apply also in the case » > 1. The moments of a distribution in Rn are the integrals d P, Rn where the Vi are non-negative integers. As in the one-dimensional case, we shall say that the above moment exists, whenever the function is integrable over Rn with respect to P. We shall now consider the integral (9.2.1) J g(x^, . . ., Xn) (IP Rn in the case when the function g only depends on a certain number of the variables, say Xj, . , Xk, where k < n. We denote by Rk the Ar-dimensional subspace of these variables. Let us first assume g bounded, and consider the divisions Rit = + 1- S[j, Rn = Si -f S(J, where the S'^ are Borel sets in Rk such that = 0 for ^ v, while St denotes the cylinder set (cf 3.5) in Rn which has the base K. The upper Darboux sum Z = MiP(S,)-^ -h 31, P(Sg) corresponding to the integral (9.2.1) is then by (8.4.2) identical with the sum z=iifiP,.. .,*(«;)+ •• + ^,Pi....,t(s;), where Pi,.. denotes the probability function of the marginal distri- bution of the variables x^, . . ., x*. This is, however, the upper Dar- boux sum corresponding to the ^-dimensional integral fgdPi,. ,1. JtA; 86 9.2-3 As the same relation holds for the lower Darboux sums, it follows that we have for any bounded g[x^, . . xi) (9.2.2) j . . ., Xk)dP^ f g(xi, . . Xk)d Rn Rk SO that in this case the ?3-dimensional integral reduces to a A;-dimen> sional integral. It is easily seen that the same relation holds whenever g is inte- grable over Rk with respect to even if g is not bounded. We may also assume g complex-valued. 9.3. A theorem on repeated integrals. — If g[x,y) is continuous in the rectangle x^h, y ^ d, we know that the relation b r1 b d il b ff = f (f glx,g)dy)dx = f (f g[x,y)dx)dy a c a c c a holds, so that the double integral can be expressed in two ways as a repeated integral. — There is a corresponding theorem for the Lebesgue-Stieltjes integral in any number of dimensions, and we shall now prove this theorem in a certain special case. Using the same notations as in 8.6, we consider two probability functions Pj and Pj in the spaces Rm and Rn respectively, and the uniquely determined probability function P in the product space Rm Rn which satisfies (8.6.1). Let and <Sg denote given sets in JRm and Rn respectively, while S = 5,) is the rectangle set in the product space with the sides » /S, and Let further g(x) and h(y) be given point functions in Rm and Rn respectively, such that «/(*) is integrable over with respect to Pj, while hiy) is integrable over iSg with respect to Pj,. Then g(x)h(y) is integrable over S S^) with respect to P, and ive have (9.3.1) f g(x}h(y)dP = fg(x)dP,f h(y)dPt. *S', Suppose first that g{x) and A(y) are bounded and non-negative. Consider the Darboux sums corresponding to the three integrals in (9.3.1) , and to the divisions + • ■ ■ + == + • • • -h where denotes the rectangle set If these sums are denoted by z and Z for the integral in the first member, 87 9^5 and by Zi and Zf, for the two integrrals in the second member, it is seen that we have ^ Z ^ Zi Zjj. By the definition of the integral, (9.3.1) then follows immediately. — Replacing further g and A by and h! — where g\ K and A" are bounded and non-negative, we obtain (9.3.1) for any real and bounded g and A. The extension to any integrable and complex- valued functions follows directly from the definition of the integral for these classes of functions. 9.4. The Riemann-Stieltjes Integral. — The considerations of 7.5 may also be generalized to n variables, where we have to employ the point function F{x^, . . ., a?n) and the difference JnF instead of the point function F{x) and the difference -F(a:r) — 2^’(a?v-i). d" F In particular it follows that, if a continuous derivative r r — ^ dXi . . , d Xn exists for all points of the interval I r = 1, . . ., w), and if g[x) is continuous in /, then the integral (9.1.1) may, for S I, be expressed as a multiple Biemann integral ... Jj,(x / a, a„ This property is immediately extended to the case of a complex-valued function ^(«), and also to infinite intervals, subject to the condition that g(x) is integrable over I with respect to P. 9.5 The Schwarz Inequality. — Consider two real functions gix) and h(x) such that the squares g^ and A^ are integrable with respect to P over the set S in Rn. The quadratic form / + vh{x)]^fiP = u^f g'^dP + 2uv j ghdP + J A*rfP s s a s is non-negative for all real values of the variables u and r Thus (cf 11.10) the determinant of the form is non-negative, which implies that we have (9.5.1) (j ghdPy^ j g*dP ■ jh*dP. ti s s 88 Chapters 10-12. Various Questions. CHAPTER 10. Fourier Integrals. For the applications to probability theory and statistics, we shall require a certain number of theorems concerning some special classes of Fourier integrals, which will be deduced in this chapter. The general theory of the subject is treated e. g. in books by Bochner (Ref. 4), Titchmarsh (Ref. 38) and Wiener (Ref. 41). 10.1. The characteristic function of a distribution in — Let F{x) denote a one-dimensional distribution function (cf 6.6), and t a real number. The function g{x) = ~ cos f j? 4- / sin is then, by 7.4, integrable over (— <=«),qo) with respect to since = The function of the real variable t QD (10.1.1) (p{t} = fe’'^dJ<'(x) will be called the characteristic function of the distribution corre- sponding to 7'’ (a:). In general g>{t) is a complex-valued function of t. Obviously we always have gp(()) — 1, and for all values of t 00 |95(<)| ^ / dF{x) = 1, — 00 5P(— /) = 9P(rt, writing a for the conjugated complex quantity of a. It further follows from 7.3 that q)[t) is continuous for all real t. If the moment of order k of the distribution (cf 7.4) exists, it follows from 7.3 that we may differentiate (10.1.1) k times with re- spect to t, and thus obtain for 0 ^k 00 (10.1 .2) 9P<’'> [t) = i* f x” d F{t). 89 10.1 Hence by 7.3 is continuous for all real t, and we have 00 (0) = f d, F{x) = cfv. — 00 In the neighbourhood of t 0 we thus have a development in Mac- Laurin’s series: { 10. 1 .3) y ^ 1 + 2 “7 (^■ ty + 0 (n 1 where the error term, divided by tends to zero as (cf 12.1). Conversely, if it is known that the characteristic function has, for the particular value f = 0, a finite derivative of even order 2 A:, this derivative is equal to the limit - lim ^ = (- l)t lim J — 00 —00 For any finite interval (a,b) we have, however, by (7.1.7), h h I* /''(J)= Urn I" jP(x) ki I vi2J)iO)l. rt a It follows that the moment « 2 fc exists, and thus (10 1.2) holds for 0 v t-. 2 /» and for all values of t. We thus see that the differentiability properties of q)[t) are related to the behaviour of 7^'(./*) for large values of since it is this behaviour that decides whether the moments a,, exist or not. It can also be shown that, conversely, the behaviour of q}[t) at infinity is related to the continuity and differentiability properties of Suppose, e. g., that F{x) is everywhere continuous, and that a continuous frequency function exists for all x, except at most in a finite number of points. We then have by (7.5.5) 00 ( 1 0 . 1 . 4 ) y (if) — J dx, — 00 and it can be shown that gp(^) tends to zero as f ->± qo. If, more- over, the w:th derivative f^^^{x) exists for all x and is such that is integrable over (— Qo,ao), a repeated partial integration shows that we have kWI < uu 90 10.1 for all ^ where f is a constant. We shall, however, not give a de- tailed proof of these properties here. Suppose, on the other hand, that F{x) is a step-function with steps of the height ik in the points x — x^. We then have by (7.1.8) (10.1.5) y(<) = the series being absolutely and uniformly convergent for all t, since ^Pv = 1. Each term of the series is a periodic function of f, and thus certainly does not tend to zero as f ± qo . It can be shown that also the sum of the series does not tend to zero as / ± Qc . Thus e.g. the characteristic function of the distribution function « (.r) defined by (6.7.1) is identically equal to 1. Not every fiinction <p(t) may be the characteristic function of a distribution. Necessary conditions are, according to the above, that (p {f) should be everywhere continuous and such that |v^(0l ~ 1, lO) = 1 and 9> {— t) = (p it). These conditions are, however, not sufficient. If, e. g., <p{f) is near / — 0 of the form (p (t) — 1 + where d > 0, then it follows from (10.1.3) that the distribution corresponding to must have «i — cfj = 0, which means cf Ifi.l' that the whole mass of the distribution is concentrated in the point # — 0. This is, however, the distribution which has the distribution function f (ar) and the characteristic function Hence in this case (pit) cannot be a characteristic function unless it is identically etjual to 1. Thus e. g. the functions and ^ ^ are no characteristic functions, though both satisfy the above necessary conditions. Various necessary and sufficient conditions are known. The simplest seem to be the following (Cramer, Ref, 71)- In order that a given ^ hounded and continuous f unction (p(J) should he the characteristic function of a distribution, it is necessary and sufficient that (piO^—- 1 and that the function A A tf> (.r, A) ~ j J (pit ■— 11)6’-^ (It dn » 0 is real and non-negative for all real ,t and all A > 0. That these conditions are necessary is easily shown. When (pit) is the characteristic function corresponding to the distribution function Fi.r) we find, in fact, — 00 1 — COS ijr + y) , — r, —dF^y, (.T I- yf and the last expression is evidently real and non-negative. — The proof that the conditions are suffieient depends on the properties of certain integrals analogous to those used in the two following paragraphs. It is, however, somewhat intricate and will not be given hero. 91 10.2 10.2. Some auxiliary functions. — Consider the functions 0 sin ht cos h t where h is real and T > 0. Obviously c(h, T) ^ 0, and h, n c(- K T) = c(h, T). By simple transformations we obtain for h > 0 0 h T 7t J I 71 ^ 8 /? _7' Now it is proved in text books on Integ^ral Calculus that the integral fsint , J 7 is bounded for all a: > 0 and tends to the limit ^ as j? ->■ oc . It follows that s{}i,T] is bomided for all real h and all 7’ > 0 and that ive have, uniformly for | A | > d > 0, ( 10 . 2 . 1 ) 1 for h> i), lim s[h, T) — ' 0 y> h = 0, T’-.-oo — 1 /t<0. We further obtain for all real h (10.2.2) lim c{h,T}=^ - I - d < = | A |. r— 00 TtJ r 92 10.3 10.3. Uniqueness theorems for characteristic functions in Aj. — If {a -~h, a -{■ h) is a continuity interval (cf ().7) of the (Ustrihuti on function F{f), wc have T (10.3.1) F{a + h) - F(a ~ h) = liin ’ f g, (<) 1. Ob 7 t J t -T This important theorem (L^vy, Ref. 24) shows that a distribution is uniquely determined by its characteristic function. In fact, if two distributions have the same characteristic function, the theorem shows that the two distributions agree for every interval that is a continuity interval for both distributions. Then, by 6.7, the distri- butions are identical. In order to prove the theorem, we write = fe>'^dF(x). It -r -00 Now the modulus of the function is at most equal to /t, so that the conditions stated in 7.3 for the reversion of the order of integration are satisfied. Hence 00 r OD T J I d F[x) I d ^ ^ j <IFW) f~f ^ ^ (r—a) t dt - oc —T — 00 0 00 = I g{j',T)dF(x), — 00 where r r , . 2 /‘sin A/ , , irain(x — a + h)t g{r.T)~ I - coa (x ~ a) f dt ~ I — ~~ — dt 7t J t 7t J t f- 1 ^ ®‘n ^ = 1 s (a; _ a + ft, J) _ i _ a _ ft, i’). t Z / Thus by the preceding paragraph \g(x^T) \ is less than an absolute constant, and we have 93 10.3 lim g {x, T) = r-oo 0 for X < a — /i, \ x = a — A, 1 a — h < X < a ■¥ h, 1 » X = a + hy 0 X > a + h. We maj thus apply theorem (7.2.2) and so obtain, since F(x) is con- tinuous for a? — fl ± hy a+h lim J — f d F(x) = F(a + h) — F{a — h)y T-* 00 J a — h SO that (10.3.1) is proved. In the particular case when \q>(t)\ is intej^rable over ( — 00 , 00 ), it follows from (10.3.1) that we have 2h 2;rj ht as soon as F is continuous in the points ./* ± h. When h tends to zero, the function under the integral tends to while its modulus is dominated by the integrable function | (p [t) |. Thus we may apply (7.3.1), and find that the derivative — exists for all a?, and that we have 00 (10.3.2) f{x) - J (p [t) d t. Then /(./•) is the frequency function (cf 6.0) of the distribution, and it follows from 7.3 that f[.r) is continuous for all values of x. — We call attention to the mutual reciprocity between the relations (10.3.2) and (10.1.4). In order tiO determine t\.x') by means of (lO.Ibl) we must know ip{t) over the whole infinite interval ( — «, go). The knowledge of xp{t) over a finite interval is, in fact, not sufficient for a unique determination of F{x\ This follows from an example given by Gnedenko (Kef. 117) of two characteristic functions which agree over a finite interval without being identical for all t. We shall give a somewhat simpler example due to Khintchine. The two functions ( 1 - I M for 1 <1 -i. 1, 0 for I f I > 1 , tpiJ) = i + 4 / cos 7t t 71^ ens 3 TT f co.s bn t 94 103 are both characteristic functions, fpxit) is the characteristic function of the distribu- tion defined by the frequency function - cos X as may be seen by taking/^ = 1, F{x)^ B{x)aiid fp{i)= 1 in (10.3.3), while ^2(0 corresponds 2 to a distribution having the mass ^ placed in the point x — 0, and the mass -v • in n 7t the point a* = wa:, wher*.^ h == ± 1, ± 3, . . .. — By summation of the trigonometrical series for ^2(0 it. is seen that for |/| £ 1. For |#| > 1, on the other hand, (p\{t) is equal to zero, while. ^ 2(0 is periodical with the period 2. We now proceed to prove a formula which is closely related to (10.3.1), hut differs from it hy containing^ an absolutely converg^ent integral. In the following paragraph, this formula will find an im- portant application. — For any real a and h > 0 we have (10.3.3) h oo j [F{a -f -S') — F{a — z)]dz j' ^ ^e~''"gp(/) df. () — oo Transforming the integral in the second member in the same wav as in the proof of (10.3.1), the reversion of the order of integration is justified by means of 7.3, Denoting the second member of (10.3.3) by f/j, we then obtain 00 00 = dF(x) j ^ 5 ?^ hi ^tt{x-a) I — 00 — 00 2 r , . r 1 — cos /t f , , == - I (lt{.r) I cos (x — (i)t tft. — 00 I) In the same way as above it then follows from (10.2.2) a -h h \ \j' — a — h \ — 2 \ x — a \ . -L- — L.. _ ^d^(x) a-\ h ~ j ^ a~h Applying the formula of partial integration (7.5.7) to the last integral^ taken over each of the intervals [a — h,a) and (a,o4-ft) separately, it is finally seen that is identical with the expression in the first member of (10.3.3), so that this relation is proved. 95 10.4 10.4. Continuity theorem for characteristic functions in Ri. — We have seen in the preceding paragraph that there is a one-to-one cor- respondence between a distribution and its characteristic function g> (^). A distribution function F(x) is thus always uniquely determined by the corresponding characteristic function g) (f), and the transformation by which we pass from F(ar) to g [t\ or conversely, is always unique. We shall now prove a theorem which shows that, subject to certain con- ditions, this transformation is also continuous, so that the relations Fn[x)-*F{x) and q)n(t)-^ g>{t) are equivalent. This theorem is of the highest importance for the applications, since it affords a criterion which often permits us to decide whether a given sequence of distributions converges to a distribution or not. We have seen in 6.7 that a sequence of distributions converges to a distribution when and only when the corresponding sequence of distribution functions converges to a distribution function. In the applications it is, however, sometimes very difficult to investigate directly the convergence of a sequence of distribution functions, while the convergence problem for the corresponding sequence of charac- teristic functions may be comparatively easy to solve. In such situa- tions, we shall often have occasion to use the following theorem, which is due to Levy (Ref. 24, 25) and Cramer (Ref. 11). We are given a sequence of distributions, with the distribution func- tions F^[jii), F^{x), . . ., and the characteristic functions (px{t), (fi[t), . . . A necessary and sufficient condition for the convergence of the sequence {Fn(x)] to a distribution function F[x) is that, for every t, the sequence {g?n(^)} converges to a limit which is continuous for the special value ^ “ 0. When this condition is satisfied, the limit q>[t) is identical with the characteristic function of the limiting distribution function F (.r). We shall first show that the condition is necessary, and that the limit q){t) is the characteristic function of F[x), This is, in fact, an immediate corollary of (7.5.9), since the conditions of this relation are evidently satisfied if we take g(x)=^e*^^. The main difficulty lies in the proof that the condition is sufficient. We then assume that g)n[t) tends for every Mo a limit (p[t) which is continuous for ^ 0, and we shall prove that under this hypothesis F'n[x) tends to a distribution function F(x). If this is proved, it follows from the first part of the theorem that the limit (p[t) is identical with the characteristic function of F(x), By 6.8 the sequence {Fn(a;)} contains a sub-sequence {7^n„(^)} con- 96 10.4 rergent to a non-decreasing function F[x\ where F(x) may be de- termined so as to be everywhere continuous to the right. We shall first prove that F[x) is a distribution function. As we obviously have 0 ^ F{x) ^ 1, it is sufficient to prove that oo) — F(“- oo) = 1. From (10.3.3) we obtain, putting a = 0, A 0 00 J W dz-j = q>n, («) dt. 0 —A —OP On both sides of this relation, we may allow v to tend to infinity under the integrals. In fact, the integrals on the left are taken over finite intervals, where Fn^ is uniformly bounded and tends almost everywhere to F^ so that we may apply (5.3.6). On the right, the modulus of the function under the integral is dominated by the function ^ which is integrable over (—00,00), bo that we may apply the more general theorem (5.5.2). We thus obtain, dividing by A, 0 —A —00 — COS h t g>(t) dt oo — coat lt\ —3 (p\-\dt. In this relation, we now allow h to assume a sequence of values tending to infinity. The first member then obviously tends to JP(+ Qo) — ii^(— oo). On the other hand, g>(t) is continuous for f = 0, so that 9^1^! tends for every t to the limit 5p(0). We have, however, 5P (0) = lim gpn (0), but 5Pn(0) = l for every «, since q>n{t) is a charac- n-*-ao teristic function. Hence 9(0 )=!. Applying once more (5.5.2), we thus obtain from the last integral, using (10.2.2), F(+ co)-F(-co)=lJ^—^*dt=l. — 00 Thus we must have 2^(-H qo)= 5-1, F ( — oo) = 0, and the limit jP (re) of the sequence {-Fn„(^)} is a distribution function. — By the first part of the proof, it then follows that the limit f>(t) of the sequence {q>n^[t)} is identical with the characteristic function of F(x), 7 — 454 H. Cramir 97 10.4 Consider now another convergent sub-seqaence of {Fn(a;)}, and denote the limit of the new sub-sequence by F^(x), always assuming this function to be determined so as to be everywhere continuous to the right. In the same way as before, it is then shown that F'^[x) is a distribution function. By hypothesis the characteristic functions of the new sub-sequence have, however, for all values of t the same limit g) (t) as before, so that g (t) is the characteristic function of both F(x) and F*(x). Then according to the uniqueness theorem (10.3.1) we have F(x) == F* (x) for all x. Thus every convergent sub-sequence of {7^n(a?)} has the same limit F(x). This is, however, equivalent to the statement that the sequence converges to F(x)j and since we have shown that is a distribution function, our theorem is proved. We know from 10.1 that a characteristic function is always continuous for every t Thus it follows from the above theorem that, as soon as the limit ^ (f) of a sequence of characteristic functions is continuous for the special value t = 0, it is continuous for every t. The condition that the limit should be continuous for the special value f = 0 is, however, essentia] for the truth of the theorem. We shall, in fact, show by an example that the theorem is not true, if this con- dition is omitted. — Let be the distribution function defined by 0 for X X n 2 n 1 n, » ~ n < X < Tly » X ^ n. The corresponding frequency function is constant equal to — in the interval (■— w, w), 2 n and disappears outside that interval. The corresponding characteristic function is by (10.1.4) e^txdx sin nt nt As n tends to indnity, converges for every t to the limit (p{f) defined by sp(f)= {: for / = 0, » t 7^0. Thus the limit is not continuous for f = 0. Accordingly, for every fixed x we have (x) — BO that the limit of F^ (x) is not a distribution function. In the case F^{x) — B{x — n) considered in 6.7, we have = so that the sequence of characteristic functions is never convergent, except when f is a mul- tiple of 2 7r. Accordingly, for every fixed x we have »0, so that the limit of F^{x) is not a distribution function, as we have already seen in 6.7. 98 10.5 10.5. Some particular integrals. — We shall now deduce some formulae that will be used in the sequel. The integpral 00 f dx = V7t — 00 is given in text-books on Integral Calculus. Substituting x VhJi for X, we obtain for h> 0 — 00 By means of 7.3 it is easily seen that we may differentiate any number of times with respect to hy so that 00 (10.5.1) = (v = 0, 1, 2, . . .). — 00 Consider now the integral J e^tx-ih^dx= J The partial sums of the series under the last integral are dominated by the function which is integrable over ( — 00 , 00 ). Thus by (5.5.2) we may integrate the series term by term and so obtain, since all terms of odd order evidently vanish, « ^ 00 — 00 ^ — 00 (10 5 2) =y - — • ^ Z(2y)! 2»vl Taking here A= 1, and introducing the function X (,0.6.3) «’W = Ffc/' e 2 dty 99 10.5-6 it follows that we have 00 00 / I ^ <* (*^^dO[x)=Y^ 1 e^^~idx = e~\ Now (10.5.3) shows that (D(x) is a non-decreasing and everywhere con- tinuous function, such that <P(— oo) = 0 and <!>(+ oo) = 1. Thus 0{x) is a distribution function, and then (10.5.4) shows that the corre- spending characteristic function is e The distribution determined by 0(x) is the important normal distribution, that will be treated in Ch. 17. — By repeated partial integration, we obtain from (10.5.4) the relation (10.5.5) We shall farther consider the inte^al CO OD 4 / = j cos txe~^dx — 00 0 (10.5.6) _ n sin tx — cos tx ~ L ‘ 00 J 1 + <* This expression may be regarded as the characteristic function corre- sponding to the frequency function /(ic) J Since the charac- teristic function is integrable over ( — oo , oo ), we obtain from (10.3.2) the reciprocal formula (10.5 7) 1 r - I T-r7id< = 1 + r 10.6. The characteristic function of a distribution in R„. — If i = (tj, . . tn) and * = (x^, . . ., Xn) are considered as column vectors (cf. 11.2) corresponding to points in we'- denote by t'x the product formed according to the rule (11.2.1) of vector multiplication: f'* = flXi + •• + tnXn. The definition (10.1.1) of the characteristic function of a one- dimensional distribution is then generalized by writing 100 10.6 (10.6. 1) . JP (t) = 50 (<„ . . ., <») = / «“'* dP, Jin where P=P(5) is the probability function of a distribution in Hn. The characteristic function q)(t) of the distribution is thus a func- tion of the n real variables in. Obviously we always have ^(0, , . 0) = 1, and for all values of the variables 19P(*)|S1, ?>(-*) = ?>(»). Further, q)(t) is everywhere continuous. If all moments of the distri- bution (cf 9.2) up to a certain order exist, we have in the neighbour- hood of the point t = 0 an expansion of g)(t) analogous to (10.1.3). The following theorem, which is a direct generalization of the uniqueness theorem (10.3.1), shows that a distribution in is uniquely determined by its characteristic function. If the interval 1 defined hy the inequalities ^ h^ < Xv < K, (v = 1, . . ., n), is a continuity interval [cf 8.3) of P(S)y we have T r „ (10.6.2) P{1) ---- Urn f f e-H, o, . . . . <i<„. T-*ai n j J j U - T -T The proof of this theorem is a straightforward generalization of the proof of (10.3.1). — In the particular case when |5P(^)| is integrable over Kn, we find as in (10.3.2) that the frequency function (cf 8.4) - - — = f[x^y . . ., Xn) — /(*) exists and is continuous for all OX^ . . . O Xji and that we have 00 00 (10.6.3) /(*) = ~~ J it) dt,... dtn. — 00 — QD The reciprocal formula corresponding to (10.1.4): 00 00 (10.6.4) 9, (t) = / . . . J «<*'*/(*) dx,... dx„ — 00 —00 is obtained from (10.6.1) and holds whenever the frequency function f(x) exists and is continuous, except possibly in certain points be- longing to a finite number of hypersurfaces in We shall also want the following generalization of the theorem (10.3.3), which is proved in the same way as the one-dimensional case. 101 10.^7 Let Iz^ 2 ^ denote the interval defined hy the inequalities Qy — < Xy < a^ (i^ = 1, . . n). Far any real ay and positive hy we have Ai ( 10 . 6 . 6 ) J J PUz,...:>„)dZi . . . dZn=- 0 0 . . dtn> 10.7. Continuity theorem for characteristic functions in R,f — The continuity theorem proved in 10.4 may be directly generalized to multi-dimensional distributions. By 8.5, a sequence of distributions in Rn converges to a distribution when and only when the corre- sponding distribution functions converge to a distribution function. As in the one-dimensional case, it is often easier in the applications to solve the convergence problem for the corresponding sequence of characteristic functions, and in such situations the following theorem will be useful. We are given a sequence of dist) ihutions in Rn, with the distribution functions F^(x), . . ., and the characteristic functions . . .. A necessary and sufficient condition for the convergence of the sequence {Fn{»)] to a distribution function F(x) is that, for every the sequence {9Pn(*)} converges to a limit q>{t), which is continuous at the special point t = 0. When this condition is satisfied, the limit g) (t) is identical with the characteristic function of the limiting distribution function F[x), The proof that the condition is necessary is quite similar to the corresponding part of the proof in 10.4, and uses the generalization of (7.5.9) to integrals in Rn (cf 9.4). It then also follows that the limit q){t) is the characteristic function of F{x). — In order to prove that the condition is sufficient, we consider a sub-sequence which converges (cf 8.5) to a limit F{x) F{xy, . . ., rn) that is non- decreasing and continuous to the right in each variable Xy, We want to show that F{x) is a distribution function, i. e. that the corre- sponding non-negative and additive set function P[S) is a probability function. For this purpose, it is sufficient to show that we have P(Rn)=l. We then apply (10.6.5) to each putting all the 102 10 . 7 - 11.1 a* == 0. When (i tends to infinity, we obtain by the same argfument as in 10.4 4 0 0 Allowing the K to tend to infinity, we then obtain, in perfect analogy with the one-dimensional case, OD , TV V ty 1 — OD 80 that the limit P{S) of the sequence {P»n^(S)} is a probability function. The proof is then completed in the same way as in 10.4. CHAPTER 11. Matrices, Determinants and Quadratic Forms. The subject of the present chapter is treated in several text-books in an elementary form well adapted for our purpose. We refer particularly to Aitken (Ref. 1), Bocber (Ref. 3), and for Scandinavian readers to Bohr-Mollerup (Ref. 5). We shall here restrict ourselves to give, for the convenience of the reader, a brief survey ■ — in many cases without complete proofs — of some fundamental definitions and properties that will be used in the sequel, adding full proofs of certain special theorems not contained in the text-books. 11.1. Matrices. — A matrix A of order m • w is a rectangular scheme of numbers or elements aa arranged in m rows and n columns: ^11 a, 2 . ■ ain Ogg • . a^n Umi am2 • ■ • We write briefly A = {ai*}, and when we want to emphasize the order of the matrix, we write Amn instead of A. We shall always assume that the elements aik are real numbers. 103 11.1 In the particnlar case when m = n = 1, the matrix A consists of one single element a^, and we shall then identify the matrix with the ordinary number a^. Two matrices A and B are called equal, and we write A = when and only when A and B are of the same order, and all corresponding elements are equal: fl<* = bit for all i and h. — We shall now define three kinds of operations with matrices: 1. The product of a matrix A and an ordinary number c is defined as the matrix obtained by multiplying every element of A by c. Thus cA A c = B^ where the elements of B are bn- = can . When c = — 1 , we write — A instead of (— 1) A, 2. The sum of two matrices A and B is only defined when the two matrices are of the same order. Then the sum C = -4 -f JJ is defined as a matrix of the same order with the elements cut = an + bik- 3. The product of two matrices A and B is only defined when the first factor A is of order m • r, and the second factor B is of order r n, so that the number of columns of the first factor agrees with the number of rows of the second factor. Then the product C — AB, or Cmn = AfnrBrny is defined as a matrix of order m • a?, with elements Cik given by the expression Cik = 2 • The element in the r.th row and kith, colunm of the product matrix is thus the sum of all products of corresponding elements from the i:th row of the first factor and the A:th column of the second, factor. The three matrix operations thus defined are associative and distri- butive, Moreover, the two first operations are commutative, while generally the third is non-commutative. Thus we have, e. g., + B) + C = ^ 4 - (B + C), C(A B) = CA CB, A T B:=B + A, (AB)C = A(BC), (A + B)C = AC + BC, e{A -f B) cA + cB, but generally not AB = BA. Even if both products AB and BA are defined, they may be unequal. We are thus obliged to distinguish between premultiplication and postmultiplication. AB means A post- multiplied by B, or B premultiplied by A. 104 11.1 From these properties, it follows e. g*. that a linear combination Cl Ai + • + CpAp is uniquely defined as soon as all the At are of the same order, and that the terms may be arbitrarily rearran^’ed. Similarly, the product Dmn = A^rBrsC^n is uniquely defined, but here no rearrangement of the factors is allowed. The elements duk of D are given by the expression r « dh k (^jk‘ i-l ^=1 The iransjmv of a matrix A ~-{atk\ of order m n is a matrix A* = \aii\ of order n ‘ m, such that ttik — Oki- Thus the rows of A' are the columns of A^ while the columns of A' are the rows of A. Obviously we have (Ay -^A, (A + By - . 4 ' 4 B\ (AB)' - B'A\ Any matrix obtained by deleting one or more of the rows and columns of A is called a suhmatrix of A. In particular every element of ^ is a submatrix of order 1*1, while the rows and columns are submatrices of order 1 n and m l res])ectively. When m -- ri, we shall call A a square matrix. Owing to the associ- ative pro})erty of matrix multi jdication, the powers A^^ . . . of a square matrix are defined without ambiguity. The elements tfai, • • •, cfnii of a square matrix form the jnain or principal diagonal of the matrix, and are called the diagonal elenicmts. A s(juare matrix which is symmetrical about its main diagonal is called a symmetric matrix. A symmetric matrix is identical with its transpose, so that we have A' — A or Uki'^afk- For an arbitrary matrix A — it will be seen that the ])roducts A A and A' A are symmetric, and of order m • m and n • n respectively. A symmetric matrix with all its non-diagonal elements equal to zero is called a diagonal matrix. If A^i^ is an arbitrary matrix, and if and Dnn are diagonal matrices, the product DnnnA^n is obtained by multiplying the rows of A by the corresponding diagonal elements of D, while the product -Drnt is obtained by multiplying the columns of A by the corresponding diagonal elements of D. A unit matri.r / is a diagonal matrix with all its diagonal ele- ments equal to 1. For any matrix A =7zA,nn we have lA =AI = A, 105 11 . 1-2 where I denotes the unit matrix of order m -m in the first product, and of order w • w in the second. A matrix (not necessarily square) having all its elements equal to zero is called a ^ero matrix, and is denoted by 0. 11.2. Vectors. — A vector is a matrix consisting of one single row or one single column, and is called a row vector or a column vector, as the case may be. Thus a row vector x ~ [x^, . . . , Xn} is a matrix of order I n, while a column vector ■=0 is of order « • 1 . In order to simplify the writing we shall, however, usually write the latter vector in the form x = {x^, . . Xn), indicating by the use of ordinary instead of curled brackets that the vector is to be conceived as a column vector. The majority of vectors occurring in the applications will be of this kind. The transpose of the colunm vector * = , . . . , iPn) is the row vector jc'= {xj, . . ., Xn], and conversely. If X = (xi, Xn) and y = ^ Ih) are two column vectors, the product xy is a matrix of order 11, i. e. an ordinary number: (11.2.1) xy + ■• + Xn^n. In particular for x — y we have XX = xl -h ’ 4- Xn. The products xy' and xx\ on the other hand, are not ordinary num- bers, but matrices of order n • n. The vectors x^ , . . . , Xp are said to be linearly dependent, if a relation of the form x, -f • -f Cp Xp — 0 exists, where the c, are ordinary numbers which are not all equal to zero. Otherwise x^, . . . , Xp are linearly independent. Similarly, functions /i , . . . , of one or more variables are said to be linearly dependent, if a relation Cj/i 4- ■ • -f 4- Cpfp-=0, where the d are constants not all = 0, holds for all values of the variables. When several linear relations of this form exist, these are called independent, if the corresponding vectors c = (c^ , . . . , Cp) are linearly independent. 106 11.^ 11.3. Matrix notation for linear transformations. — A linear trans- formation = flu ifi + 3^1 ■ H" n yn , (11.3.1) ~ yi + ^*2 yf + • • • + «an yn, iCw = ami Vi + Oma yi + • • • H- amnyn, establishes a relation between two sets of variables, . . ., Xm and yi) • • •> yny where m is not necessarily equal to n. The matrix A = = -^fiin== [aik) is the transformation matrix. Now if X = (xi, . . . , Xm) and y = (yi, . . . , yn) are conceived as column vectors, the right-hand sides of the equations (11.3.1) are the elements of the product matrix Ay^ which is of order m l, i. e. a column vector. Thus (11.3.1) expresses that the corresponding elements of the column vectors x and Ay are equal, so that in matrix notation the transformation (11.3.1) takes the simple form x = Ay. 11.4. Matrix notation for bilinear and quadratic forms. — In the column vectors x and y of the preceding paragraph, we now consider the Xi and y* as two sets of independent variables, and form the product matrix xAy^ where A — Afnn — {atjt}. This is a matrix of order 11, i. e. an ordinary number, and we find (11.4.1) xAy = ^an xtyt, i,k where ? = 1, 2, . . . , m and i = 1, 2, . . , , w. Thus the hiltnear form in the variables Xi and yt that appears here in the second member has a simple expression in matrix notation. In the important particular case when m = w, x = y and A is symmetric, the bilinear form (11.4.1) becomes n (11.4.2) x'Ax = ^OitXiXk, <,*=1 where aki = aik- This expression is called a quadratic form in the vari- ables Xi, . . . , Xn, and will often be denoted by Q{x) or Q(xiy . . . , Xn). In matrix notation, we thus have Q(x) = x Ax, The symmetric matrix A is called the matrix of the form Q, If, in particular, A =^1^ we have Q = xlx = xx = xf 4- • -f xj. 107 11 . 4-5 The matrix expressions (11.4.1) and (11.4.2) are particularly well adapted for the study of linear transformations of bilinear and quadratic forms. Thus if, in the quadratic form Q {Xj , , Xn) = n = ^ aik XiXjc, new variables are introduced by the linear transformation x = Cy, where C = Cnm^ the result is a quadratic form QiiPi, ■ • ym) in the new variables: m Q(Xi, . . ., Xn) = Q,{y,, . . ., ym) = ^bik ytyk, *, A'=l and the matrix expression (11.4.2) then immediately gives Q == xAx = yCA Cy = yBy, where B = C'A C. By transposition it is seen that this is a symmetric matrix, and thus the matrix of the transformed form is C'AC. The order is, of course, m • m. 11.5. Determinants. — To every square matrix A = Ann=^{cnk} corresponds a number A known as the determinant of the matrix^ which is denoted ■4 = M I = I «f* 1 = ^11 ^12 ^21 ^22 n ai n Uni CLn'l • • • einn The determinant is defined as the sum A — i a\ r, ra . . . dn , where the second subscripts , . . . , rn run through all the n ! possible permutations of the numbers 1,2, . . ., w, while the sign of each term is 4- or — according as the corresponding permutation is even or odd. The number n is called the order of^the determinant. The determinants of a square matrix ‘A and of its transpose A* are equal: A = A! . If two rows or two columns in A are interchanged, the determinant changes its sign. Hence if two rows or two colunms in A are identical, the determinant is zero. If B and C are square matrices such that AB = C, the corresponding determinants satisfy the relation AB = C. 108 11.5-6 When A is an arbitrary matrix (not necessarily square), the deter- minant of any square submatrix of A is called a minor of A. When A is square, a principal minor is a minor, the diag;onal elements of which are diagonal elements of A, In a square matrix A = the cofactor Aik of the element au* is the particular minor obtained by deleting the /:th row and the i:th column, multiplied with (— 1)*^^'. We have the important identities (11.5.1) (11.5.2) and further (11.5.3) ^ . {A for i -= Ic i=i n ^ . [A for i = ^ , " I 0 for ^ n A = A at I a\k .*1-11. it, t\k=2 where An ii: is the cofactor of a,k in A^^ 11.6. Rank. — The rafik of a matrix A (not necessarily square) is the greatest integer r such that A contains at least one minor of order r which is not eijual to zero. If all minors of A are zero, A is a zero matrix, and we put — 0. When A Amn, the rank r is at most equal to the smaller of the numbers m and n. Let the rows and columns of A be considered as vectors. If A is of rank r, it is possible to find r linearly independent rows of A^ while any r -f 1 rows are linearly dependent. The same holds true for columns. li A^^ A^^ ^ Ap are of ranks rg, . . . , the rank of the sum A^-\- ' -f- is at most equal to the sum •• rp, while the rank of the product A^ . . . Ap is at most equal to the smallest of the ranks . . ., rp. If a square matrix A -- A„n is such that A ^ 0, then A is of rank n. Such a matrix is said to be non-singular^ while a square matrix with A = Q is of rank r c n and is called a singular matrix. If an arbitrary matrix B is multiplied (pre- or post ) by a non-singular matrix A^ the product has the same rank as B. When the matrix of a linear transformation is singular or non-singular, the corresponding adjectives are also applied to the transformation. 109 ll.fr-7 If A is symmetric and of rank r, there is at least one principal minor of order r in ^ which is not zero. Hence in particular the rank of a diagonal matrix is equal to the number of diagonal elements which are diffei*ent from zero. n The rank of a quadratic form Q = x'Ax = ^aaXiXk is, by defini' tion, equal to the rank of the matrix A of the form. According as A is singular or non-singular, the same expressions are used with respect to Q. A non-singular linear transformation does not affect the rank of the form. If, by such a transformation, Q is changed into r 2 where Xi 7 ^ 0 for i = 1 , 2 , . . . , r, it follows that Q is of rank r. 1 The rank is the smallest number of independent variables, on which Q may be brought by a non-singular linear transformation. A proposition which is often useful is the following: If Q may be written in the form Q = LI ■ + Zp , where the Li are linear func- tions of a?! , . . . , Xn , and if there are exactly h independent linear relations (cf 11.2) between the Z/, then the rank of ^ is p — A. It follows that, if we know that there are at least h such linear relations, the rank of ^ is ^p — h. 11 . 7 . Adjugate and reciprocal matrices. — Let A [uik] be a square matrix, and let as before A a denote the cofactor of the element Oik- If we form a matrix \Aik] with the cofactors as elements, and then transpose, we obtain a new matrix A"^ — { af* } , where a% = Aki. We shall call A* the adjugate of A. By the identities (11.5.1) and (11.5.2) we find [ A 0. . .0 (11.7.1) AA*=A*A = Al=f^^-^- 0 0 ...^, For the cofactor Aik of the element ah' = Aki in A* we have (11.7.2) Ark=A--^^aki^ This is only a particular case of a general relation which expresses any minor of A* in terms of A and its minors. We shall here only quote the further particular case no (11.7.3) = Ail An Aik = AAii. at. 11 . 7-8 All Aik Aik When A is non singular, the matrix == the reciprocal of A, We obtain from (11.7.1) (11,7.4) AA-^ = A-^A = I. is called The matrix equations AX = I and XA = I then both have a unique solution, viz. X = A~^^ It follows that the determinant of A~^ is A~^, Further == A^ so that the relation of reciprocity is mutual. The transpose of a reciprocal is equal to the reciprocal of the trans- pose: [A-^y = (A*)-^. For the reciprocal of a product we have the rule (AB)-^ = B-^A-K When A is symmetric, we have Aki = A{k, so that the adjugate A* and the reciprocal A-^ are also symmetric. The reciprocal of a diagonal matrix D with the diagonal elements eZ] , . . . , dn is another diagonal matrix with the diagonal elements If Q = x'Ax is a non-singular quadratic form, the form = = xA~^x is called the reciprocal form of Q. Obviously ((^"^)"’^ = C- Let x = (ajj , . . , , iCn) and t = (ifj, . . . , ^n) be variable column vectors. If new variables y — {Vi, • • • , ^m) and u = («j , . . . , Um) are introduced by the transformations ( 11 . 7 . 5 ) y = Cx^ t=^Cu, where C = we have (11.7.6) t'x = uCx = uy. The bilinear form t'x = 1^X1 -1- • • • + f « Xn is thus transformed into the analogous form uy = ^1^1 H- • • • + Wm in the new variables. Two sets of variables Xi and U which are transformed according to (11.7.5) are called contragredient sets of variables. In the particular case when m = n and C is non-singular, (11.7.5) may be written (11.7.7) y = Cx, u = (C')“»t. 11.8. Linear equations. — We shall here only consider some particular cases. The non-homogeneous system 111 ( 11 . 8 . 1 ) fljl iCj "f" * ’ * tl •^W ““ * flnl iPi + flna ajj 4* * • • + Ofin Xn==hn, is equivalent to the matrix relation Ax — h, where A = {aik], * = = (a?j, . . x„) and fc = (Ai, . . ., An). If A is non-singular, we may premultiplj both sides by the reciprocal matrix and so obtain the unique solution x = A'"^ or in explicit form J (11.8.2) Xk = j2if^iAik (A= 1,2, . . .,w). Thus Xk is expressed by a fraction with the denominator A and the numerator equal to the determinant obtained from A when the ele- ments of the A;:th column are replaced by the second members , , ,,hn. This is the classical solution due to Cramer (1750). Consider now the homogeneous system (11.8.3) Xx + a, 8 a:, 4 4 am a:n == 0, O/m 1 a?! 4 ant 2 a ?2 -i- ' ■ • 4 Umn Xn — 0, or in matrix notation Ax = 0, where m is not necessarily equal to n. By 11.6, the matrix A is of rank r^n. If r = w, the system (11.8.3) has only the trivial solution * = 0. On the other hand, if r < n, it is possible to find n — r linearly independent vectors Cj, . . ., Cn^r such that the general solution of (11.8.3) may be written in the form x = t^c^ 4 - -f tn-rCn-r, where the U are arbitrary constants. 11.9. Orthogonal matrices. Characteristic numbers. — An ortho- gonal matrix is a square matrix C= {c,jt} such that CC' —I, Hence C* = 1, so that the determinant C7 = | C| = ± 1. Obviously the trans- pose C' of an orthogonal C is itself orthogonal. Further C~^ = C', and thus by the definition of the reciprocal matrix Cik = Ccik for all t and A, and hence by the identities (11.5.1) and (11.5.2) (11.9.1) (11.9.2) 2 n 2 i=i fi Ctj j ^ / 1 = j 0 112 for i = k, for i 7 ^ k, for i = k, for i 7 ^ k. 11.9 The product C, C, of two orthogonal matrices of the same order is itself orthogonal. — If any number p< n of rows cnyCi^ c^n (/ = 1,2, . . . , p) are given, such that the relations (11.9.1) are satished, we can always find n—p further rows such that the resulting matrix of order n-n is orthogonal. The same holds, of course, for columns. The linear transformation x — Cy, where C is orthogonal, is called an orthogonal transformation. The quadratic form xx ~ x] • + xl is invariant t under this transformation, i. e. it is transformed into the form y'C Cy = y’y — //J + ■ + , which has the same matrix /. — The reciprocal transformation y — C is also orthogonal, since C~^ = C' is orthogonal. The orthogonal transformations have an important geometrical significance. In fact, any orthogonal transformation may be regarded as the analytical expression of the transformation of coordinates in an euclidean space of n dimensions which is eflPected by a rotation of a rectangular system of coordinate axes about a fixed origin. The distance (.tJ + • ■■ -f from the origin to the point (x, , . . ir„) is invariant under any such rotation. If A is an arbitrary symmetric matrix, it is always possible to find an orthogonal matrix C such that the product C'AC is a dia- gonal matrix: X, 0 . . 0 (11.9.3) CAC^K= 0 Xg . . 0 00 . Any other orthogonal matrix satisfying the same condition yields the same diagonal elements , . . . , though possibly in another arrange- ment. The numbers . . ., x„, which thus depend only on the matrix A, are called the characteristic numbers of A. They are the n roots of the secular equation All X . . . Ain (11.9.4) Agi A 22 X . . . A2n Un 1 an 2 • • • Ann X and are all real. Since C is non-singular, A and K have the same rank (cf ll.b). Hence the rank of A is equal to the number of the roots X, which are not zero. From (11.9.3) we obtain, taking the determinants on both sides and paying regard to the relation = I , 113 8 — 464 H. CranUr 11 . 9-10 (11.9.5) j 4 = X, . . . x„. If A is non-singular, the identity (11.9.6) \A~^-yiI\ = {-^YA-^\A-U\ shows that the characteristic numbers of A~^ are the reciprocals of the characteristic numbers of A, Finally, let be a matrix of order m * n, where If JB is of rank m, the symmetric matrix BB* of order m • m has all its charac- teristic numbers positive. It follows, in particular, that BB' is non- singular. — This is proved without difficulty if, in (11.9.3), we take A = BB' and express an arbitrary characteristic number Hi by means of the multiplication rule. 11.10. Non-negative quadratic forms. — If, for all real values of the variables , . . . , , we have Q[x^, . . .,Xn) ='^aikXiXk S 0, i, A:=l where Uki — a,A:, the form Q will be called a non-negative quadratic form. If, in addition, the sign of equality in the last relation holds only when all the Xi are equal to zero, we shall say that Q is definite positive. A form Q which is non-negative without being definite positive, will be called semi-definite positive. Bach of the properties of being non-negative, definite positive or semi-definite positive, is obviously invariant under any non-singular linear transformation. The symmetric matrix A ^ {at k] will be called non-negative, definite positive or semi-definite positive, according as the corresponding quadratic form Q — xAx has these properties. The orthogonal transformation x = Cy, where C is the orthogonal matrix occurring in the special transformation (11.9.3), changes the form Q into a form containing only quadratic terms: (ll.lO.l) Q (^Tj , . . . , Xn) ~ ^^1 "t" + • H- Xn , or in matrix notation xAx =y'Ky, where the x, are the characteristic numbers of A, while K is the corresponding diagonal matrix occurring in (11.9.3). By the same orthogonal transformation, the form Q —■ X (x^ -I- ■ • + Xn) is transformed into (xj — x) y J + • • + (xn — x) yi . If x ^ 114 11.10 the smallest characteristic number of the last form is obviously non-negative, and it follows that the form Q — x(x] + • 4- Xn), with the matri A — xl, has the same property. If the form Q is definite positive, the form in the second member of (11.10.1) has the same property, and it follows that in this case all the characteristic numbers Xi are positive. Hence by (11.9.5) we have > 0, so that A is non-singular. If, on the other hand, Q is semi-definite positive, the same argu- ment shows that at least one of the characteristic numbers is zero, so that ^ = 0. If Q is of rank r, there are exactly r positive charac- teristic numbers, while the m — r others are equal to zero. In this case, there are exactly w— r linearly independent vectors Xp = (x[p\ . . . , such that Q (xp) = 0. The geometrical significance of the orthogonal transformation considered above is that, by a suitable rotation of the coordinate system, the quadric ©(xj, . . oJn) = const, is referred to its principal axes. If Q is definite positive, the equation Q = const, represents an ellipsoid in n dimensions, with the semi-axes zr ^ . For semi-definite forms (>, we obtain various classes of elliptic cylinders. If Q is definite positive, any form obtained by putting one or more of the Xi equal to zero must be definite positive. Hence any principal minor of Q is positive. For a semi-definite positive the same argument shows that any principal minor is non-negative. — It follows in particular that if, in a non-negative form Q, the quadratic term xf does not occur, then Q must be wholly independent of Xi. Otherwise, in fact, the principal minor aaak k — a'k would be negative for some h, — Conversely, if the quantities An. 22 , • ■ ■ , ^ 11. 22 . . «-i, n-i are all positive, Q is definite positive. The substitution x A~^ y changes the form Q — xAx into the reciprocal form Qr^ —y'A-^y. Thus if Q is definite positive, so is and conversely. This can also be seen directly from (11.9.6). — Consider now the relation (11.5.3) for a definite positive symmetric matrix A. Since any principal submatrix of A is also definite positive, it follows that the last term in the second member of (11.5.3) is a definite positive quadratic form in the variables , . . . , , so that we have 0 < A ^ and generally (11.10.2) Q<A^ an An [i = 1, 2, . . ., w). By repeated application of the same argument we obtain 115 11 . 10-11 (11.10.3) 0 < A ^ ^11 The sign of equality holds here only when A is a diagonal matrix. — For a general non-negative matrix, the relation (11.10.3) holds, of course, if we replace the sign < by n 11.11 . Decomposition of 2 certain statistical applications 1 we are concerned with various relations of the type ( 11 . 11 . 1 ) = where Qi is for i — 1 , 2, . . . , Ar, a non-negative quadratic form in Xi, . . Xn oi rank n . Consider first the particular case Jc = 2, and suppose that there exists an orthogonal transformation changing into a sum of ri squares: Qi = ^yL Applying this transformation to both sides of 1 n (11.11.1) , the left-hand side becomes ^ pt , and it follows that is 1 n changed into ^ Thus the rank of Q 2 is — n — and all its Tj-f 1 characteristic numbers are 0 or 1. — As an example, we consider the identity ( 11 . 11 . 2 ) ^xl=nx* + ^{xi-£Y, 1 1 1 " where i; = - ^ orthogonal transformation y ~ Cx such that 1 the first row of C is 1 Vnf -y— , . . , will change the form = V n V n into y\, Tlius the same transformation n n n changes ^ 2 decomposition of 2 1 2 1 cording to (11.11.2), the two terms in the second member are thus of ranks 1 and w — 1 respectively. 116 11.11 Consider now the relation (11.11.1) for an arbitrary i>l. We shall prove the following proposition due to Cochran (Bef. 66; cf also Madow, Bef. 154): 1 ; jy 2 ~ eocists an orthogonal transformation x = Cy chang- 1 ing each Qi into a sum of squares according to the relations ri r,+r, n 1 r.+l ^ i. e. such that no two Qi contain a common variable yi. We shall prove this theorem by induction. For A = 1, the truth of the theorem is evident. We thus have to show that, if the theo- rem holds for a decomposition in A — 1 terms, it also holds for A terms. In order to show this, we first apply to (11.11.1) an orthogonal rx transformation x = C^z changing into ^ • This gives us 1 2(1 — ^i)^i + 2 ~ » 1 r,+l where Qi denote the transforms of Qi, • • Qk- We now assert that all the x/ are equal to 1. Suppose, in fact, that p of the Xi are different from 1, while the rest are equal to 1. Both members of the last relation are quadratic forms in . . . , The rank of the first member is w — Tj 4- j?, while by 11.6 the rank of the second member is at most equal to + + Vk = n — r^ . Thus /? = 0, and ail X, = 1 , so that we obtain ( 11 . 11 . 3 ) = ••+ Qk. ri + l Here, the variables , . . . , -g'r do not occur in the first member, and we shall now show that these variables do not occur in any term in the second member. If, e. g., Q 2 would not be independent of , then by the preceding paragraph Q 2 must contain a term ce* with c > 0. Since the coefficients of in $3, . . Q'k are certainly non-negative, this would, however, imply a contradiction with (11.11.8). n Thus (11.11.3) gives a representation of 2^* ^ of A -- 1 r,-H 117 11 . 11-12 non negative forms in -s'r,+i, • • - i -^'n. By hypothesis the Cochran theo- rem holds for this decomposition. Thus there exists an orthog'onal transformation in n — variables, replacing by new variables 2/r,+i, . . ^,yn such that (n.n.4) = 2 fj + l If we complete this transformation by the equations — Zr^ = yr ^ , we obtain an orthogonal transformation in n variables, z ~ such that (11.11.4) holds. The result of performing successively the transformations x = C^z and z ~ C^y will be a composed transformation x = C,Cj,y which is orthogonal, since the product of two orthogonal matrices is itself orthogonal. This transformation has all the required properties, and thus the theorem is proved. Let us remark that if, in (11.11.1), we only know that every Qi is non-negative and that the rank of is at most equal to r,, where k 2 >"* = we can at once infer that Qi is effectively of rank n, so that 1 the conditions of the Cochran theorem are satisfied. In fact, since the rank of a sum of quadratic forms is at most equal to the sum of the ranks, we have, denoting by n the rank of , k k w ^ 2 ^ 2 ^ 1 1 Thus 2 ^ ^ This evidently implies — ?% for all i. We finally remark that the Cochtan theorem evidently holds true ifi in (11.11.1), the first member is replaced by a quadratic form Q in any number of variables whieh^ by an orthogonal transformation, may be 71 transformed into ^ • 11.12. Some integral formulae. — We shall first prove the im- portant formula 00 00 n (11.12.1a) y ^ c’* ^ - 00 — 00 118 11.12 or in ordinary notation (11.12.1b) dx^ . . . dxn~" (2 7tY '«) "' VT " where ^ is a definite positive quadratic form of matrix A, while t j ... j tn) is a real vector. As in the preceding paragraphs, A is the determinant \A\^ while is the reciprocal form defined in 11.7. — For n — 1, the formula reduces to (10.5.2). In order to prove (11.12.1a) we introduce new variables y == ~ (.Vi 5 • • • > .Vn) by the substitution x == Cy, where C is the orthogonal matrix of (11.9.3), so that C'AC — where JSC is the diagonal matrix formed by the characteristic numbers Xj of A. At the same time we replace the vector t by a new vector ii = (wj, . . Un) by means of the contragredient substitution (cf 11.7.7) f = (C')“^ m, which in this case reduces to t = Cu^ since C is orthogonal. By (11.7.6) we then have t'x = uy. Denoting the integral in the first member of (11.12.1 a) by J, we then obtain, since C = ± 1 , 00 oo 00 J= j' j e*’* di/i . . . d^n = JJ I dyj. — 00 - 00 ‘^**”^—00 Applying (10.5.2) to every factor of the last expression, we obtain n ^ n Uj 11 7 — ^ .-iti'K-'u C/ ■ r C T ^ ^ y Xj Xg . . . Xn y A since by 11.7 the diagonal matrix with the diagonal elements - is identical with the reciprocal JSC“^ while by (11.9.5) we have A = = x,X 3 j...Xn. We have, however, K-^ = {C'AC)~^= C-^ A-^(C')~^ = = C'A~^ C, since C is orthogonal. Hence ii'JiC"^ u — uCA'’^ Cu = t'A-^ and thus finally Va 119 11.12 i. e. the formula (11.12.1a). — Putting in particular 1 = 0, we obtain the formula ( 11 . 12 . 2 ) f I _0O —00 Q{Tt, n This holds even for a matrix A with complex elements, provided that the matrix formed by the real parts of the elements is definite positive. We further consider the integral ^ “ / * / * * ‘ which represents the ii dimensional » volume » of the domain bounded by the ellipsoid Q = c®. The orthogonal transformation used above, followed by the simple substitution = shows that we have V X, £ *’ < 1 1 The last integral represents the volume of the ri-dimensional »unit sphere*, and it will be shown below that its value is — ; so that n (11.12.3) c" Va^ We shall finally require the value of the integral Bit = / / XiXkdx^ . . . dxn, Q < c* extended over the same domain as the integral V. Making the same substitutions as in the case of F, we find by some calculation that the matrix B with the elements Bik is B^g,CK-^C=-9nA-\ where 120 11.12 gn+3 r r Qn = y ~ / * ■ ‘ I It will be shown below that we have so that ( 11 . 12 . 4 ) « Ba c* V « + 2 Ak i A' The Dirichlet integrals used above; i, = j ... fdZi . . . dzn and z\dz^. ..dzn, n extended over the n-dimensional unit sphere ^^z* < can be calculated by means 1 of the transformation z, = cos <pi , r, = sin g>i cos 9^,, Zi == sin 9)1 sin 9), c^os (p^j Zn = sin 9), . . . sin 9)^_i cos 9)^, which establishes a one-to-one correspondence between the domains < 1 and 0 < 9)j. < 7t (t = 1, 2, . . . , w). The Jacobian of the transformation is ( — 1)« (sin 9)1)". (sin 9>,)”“ ^ . . . sin g>^. With the aid of the relation ft 7C 2 j (sin 9?)” d<p~ 2J (sin 9?)" d(p 0 0 which is proved by substituting x = sin* 9? and using (12.4.2), we then obtain n n n 2 j, =/(siii^,)"d9), ... f sin • • '■(”+■) ' n ft 7t n ^2 7, = / (8in flj,)" cos* 9id<f>,f (sin v,)"-* d<Pt... / sin d <p„ = — r 000 2ri- +21 121 12.1 CHAPTER 12. Miscellaneous Complements. 12.1. The symbols 0, o and — When we are investigating^ the behaviour of a function f(x) as x tends to zero, or infinity, or some other specified limit, it is often desirable to compare the order of magnitude of f{x) with the order of magnitude of some known simple function g(:x)- In such situations we shall often use the fol- lowing notations. 1) When remains bounded as x tends to its limit, we write f/W f{x) =- 0{g{xi), which may be read: »/(j:) is at most of the order g[x)». ■f'l A 2) When tends to zero, we write /(x) = o U’)), which may fJW be read: '^f{x) is of a smaller order than g[oc)». 3) When tends to unity, we write /(,r) r;(x), which maybe read: is asymptotically equal to g[x)>''. Thus as a; — > Qo we have e. g. ax h=- 0 (.r), a?” = o (c^), x + logx ^x. Symbols like 0(x), o(l) etc. will often be used without reference to a specified function f(x). Thus e. g. ()(x) will stand for »any func- tion which is at most of order x''>, while 0(1) signifies »any bounded function », and o(l) *any function tending to zero». As a further example we coosider a function /(.r) which, in some neighbourhood of X = 0, has n continuous dcrivatire.s. We then have the Mac J-auriu expansion where /('>)(ex)-/(’‘)(0) X”, (o<e < 1 ). Now by hypothesi* x) — tends to zero with x. According to the ahore we may thns write, as x tends to zero, /(x) = 2"^'* ^ + 0 (*")■ This relation, which holds even when /"(x) is complex, has already been used in ( 10 . 1 . 8 ). 122 12^ 12.2. The Euler-MacLaurIn sum formula. — We define a sequence of auxiliary functions Pj (x), Pg (a?), ... by the trigonometric expansions ( 12 . 2 . 1 ) Pik{x) = 2 cos 2v7tx 2**-i Pik+l{x) ^ Bin2v7tx All these functions are periodical with the period 1, so that Pn(x + 1) = P„(x). For w > 1, the series representing Pn(x) is absolutely and uniformly convergent for all real a?, so that P»(a:) is bounded and continuous over the whole interval (~ qo,qo). The series for P^ (a:), on the other hand, is only conditionally con- vergent, and it is well known that we have Pj (rr) = — a? + J for 0 < X < 1 . Denoting by [x] the greatest integer ^ x, it follows from the periodicity that we have for all non-integral values of x P, (x) = [x]-x+i. Thus every integer is a discontinuity point for Pi(x), and we have 1 Pj (a?) I < i for all x. For integral values of x we have P2k{m) ^(2 A;)!' Pajfc+i W = 0. The numbers P» appearing here are the Bernoulli numbers defined by the expansion ( 12 . 2 . 2 ) We have C *— 1 2jv\^' Bg = 1, Bi== — i, B, = I, B^ = — 50, Bg — i5, . . . , while all the Bv of odd order ^ 3 are zero. — For w,> 1 we have £p,(x) = (- 123 12.2 For n> 2 this relation holds for all x, while for n = 2 its validity is restricted to non-integral values of x. Consider now a function g(x) which is continuous and has a con- tinuous derivative g* [x] for all x in the closed interval [a + Wj A, a + Wg A), where a and A > 0 are constants, while and lu are positive or negative integers. For any integer v such that ^ v < we then find by partial integration r+l t'-fl AJ 1\[x)g [a h,r)dx — — ig{a v}i)—\g{a^ (v-\- \) li) -\- j g{a hx) dx, V V Hence we obtain, summing over v == Wg — 1, Tlf ■+ hv) — J ff dx 4- J f/ (a + n^h) + ^ g(a + A) — n, nt (12.2.3) na — A Pj (x) g [a H- hx) dx. n, This is the simplest case of the Euhr-MacLaurin sum foy'mula^ which is often very useful for the summation of series. If g(x) has con- tinuous derivatives of liigher orders, the last term can be transformed by repeated partial integration, and we obtain the general formula Tlj Vq ^g{a hv) — I g (a It x)dx ^ g {a -f ??i A) 4- J g (a -r A) — - n, ih (1 2.2.4) - h-^-^ (« + h) - (a + ». /<)J + + (- f + hx)<lx, n, where s may be any non-negative integer, provided that all derivatives appearing in the formula exist and are continuous. 00 OD « If ^g (a 4 hv) and J g(ci hx)dx both converge, we obtain from — 00 — 00 the formula (12.2.3) 00 00 00 (12.2.5) ^g[a hv) — j g [a hx)dx’-h J Pi{x)g(a 4 hx)dx, -00 — oo —00 124 12.2-3 where the last integral must also converge. If, in addition, -► 0 as a:— ♦± ao for v = 1, 2, . . ., s, we obtain from (12.2.4) (12.2.6) 2 ^ = — 00 00 00 = f -f (— P 2 #+i (x) [a + hx)dx. If, in (12.2.3), we take g{x) = - , a — 0, A = 1, W| = 1 and n, =« w, we obtain X ' Piior) v* ^ 1 ■ 1 _i_ ^ _i_ 2- = log„.ri + 2„ + 1 1 From the definition of Pj (x), it is easily seen that 0 < / X* 8w* so that we have (12.2.7) where n C+i+o(-\). 1 C^i-h f cf .r = 0.6772 . . . X X 1 is known as Euler's constant. 12.3. The Gamma function. — The Gamma function P(p) is de- fined for all real p > 0 by the integral (12.3.1) I"(p)~ f x^~^e~^dx. By 7.3, the function is continuous and has continuous derivatives of all orders: oo (p) — f (log xy e”® 0 for any p > 0. When p tends to 0 or to + oo , E{p) tends to + oo . Since the second derivative is always positive, r(p) has one single minimum in (0, oo ). Approximate calculation shows that the minimum is situated in the point Po = 1.4bl6, where the function assumes the value P(po) = 0.8856. 126 12^-4 Bj a partial integration, we obtain from (12.3.1) for any p > 0 r{p + i)==pr{p). When p is equal to a positive integer n, a repeated use of the last equality gives, since F(l)= 1, r(n + 1) = w! Prom (12.3.1) we further obtain the relation (12.3.2) where a > 0, A > 0. If we replace here a by a it and develop the factor in series, it can be shown that the last relation holds true for complex values of a, provided that the real part of a is positive.^) By (12.3.2), the function (12.3.3) /(x; a, A) = r(i) 0 ^X-1 for X > Oy for X ^ 0, has, with respect to the variable x, the fundamental properties of a frequency function (cf 6.6): the function is always non negative, and its integral over ( — is equal to 1. The corresponding distribu- tion plays an important role in the applications (cf e. g. 18.1 and 19.4). It has the characteristic function 00 00 J e‘‘*/(x; a, A,) dz = J = (12.3.4) r(A) 12.4. The Beta function. — The Beta function B{p,q) is defined for all real p > 0, q> 0 by the integral A reader acquainted with Cauchy’s theorem on complex integration will be able to deduce the yalidity of (12.3.2) for complex oc by a simple application of that theorem. 126 12.4 1 (12.4.1) B(p,q)-= / 0 We shall prove the important relation (12.4.2) The integral B(p,q) r(p)r(g) r{p + q) j <p+9-» e-«(i+x) ft-i g-f^ 0 regarded as a function of the parameter t, satisfies the conditions of the integration theorem of 7.3 for any interval («, oo) with ^ > 0, so that Tve have r{p) / e-*dt = fdxf t 0 e When € tends to zero, the first member tends to r(p)r(q). In the second member, the integral with respect to t tends increasingly to the xP- limit r(p which is integrable with respect to (0, 00 ). According to (5.5.2) we then obtain X over r[p)r[q) = r(p + q)j 0 Introducing the new variable y X 1 + x in the integral, we obtain the relation (12.4.2). Taking in particular jp = in (12.4.2) we obtain, introducing the new variable y = 2x— 1, 1 1 (12.4.3) J(1 - dy. 0 0 For J5 = i this gives 1 " 0 127 12,4-5 On the other hand, putting in (12.4.3) we obtain r^) r(2i>) 21-2P r(p)r( i) r(p + i) (12.4.4) oap-i r(2p) = -^^ rip) r(p -hi). V 7t If we define a function fi(x\ p, g) by the relation (12.4.5) r[ p + f/) r(p) /'((?) ^P-i (I __ for 0 < X < 1, and put /9(x; p, = 0 outside that interval, it follows from (12.4.1) and (12.4.2) that this function has the fundamental pro- perties of a frequency function. The corresponding distribution, which has its total mass confined to the interval (0,1), will be further dis- cussed in 1B.4. 12.5. Stirling's formula. — We now proceed to deduce a famous formula due to Stirling, which gives an asymptotic expression /or r[p) when p is large. We shall first prove the relation for any p > 0. By repeated partial integration we obtain fxP-^ll-^Ydx = j ^ V nj p(p+l)...[p+ n) 0 go The first member of this relation may be written tLS J g (x, n) dx, 0 g (x, v) == x^”^ ^1 — for 0 < x < w, and g (x, w) == 0 for x ^ w. As w tends to infinity, gix.n) tends to for every x > 0, and it is easily seen that we always have 0 ^ ^^(x, «) < x^“^ Hence by (5.5.2) we obtain (12.5.1). It follows from (12.5.1) that logF(p) = lim Sn, where 128 12.5 n n Sn=P log W + 2 log V — 2 (P + ^)- 1 0 Applying the Ealer-MacLaurin formula (12.2.3) to both sums in the last expression, we obtain after some reductions -Sn = (j) — i) log i) — (p + f> + J) log (l + ^) + + 1 dx + J p + X dx. As n tends to infinity, the second term on the right-hand side tends to — p, while the two integrals are convergent (though not absolutely), owing to the fluctuations of sign of Pj (a:). Thus we obtain (12.5.2) log r{p) =={p — i) logp—p k -h R(p), where A: is a constant, and the remainder term R [p) has the expression R p X 0 dx. This integral may be transformed by repeated partial integration, as shown in (12.2.4), and we obtain in this way -K(p) = + (- «o {p + dx for — 0, 1, 2, . . . For any 6* > 0, the integral appearing here is absolutely convergent, and its modulus is smaller than A 00 / dx ”25Ti V A_ 2 where A is a constant. It follows in particular that P (p) -> 0 as p — ♦ oo . In order to find the value of the constant h in (12.5.2), we observe that by (12.4.4) we have log r(2p) = log r(p) + log r(p + i) 4- (2p — l) log 2 — J log n. 9 — 464 H. CranUr 129 12.5 Substituting here for the F-functions their expressions obtained from (12.6.2) , and allowing p to tend to infinity, we find after some reductions i J log 2 7t. We have thus proved the Stirling formula : (12.5.3) log r(p) = (p — i) log p p + i log 2 TT 4- jR (p), where 0 = J L_ + o/l 12p 360 p* \p® From Stirling’s formula, we deduce i. a. the asymptotic expressions w! = F(« -f l)c^^ and further, when p -> oo while h remains fixed, r{p + h) r(p) oo By differentiation, we obtain from Stirling’s formula (12.6.4) r'(p) r(pj = log p ~ 1_ 2p c w J ip + a?)* dx, r'i.p) _ (r'{p)V = 1 , _L , o f (?:)_ •TCp) \r[p)} p 2p’ J(p + x)‘ For p = 1, the first relation gives ( 12 . 6 . 6 ) where C is Euler's constant defined by (12.2.7). — Differentiating the equation r(,p + 1) = pTip), we further obtain r(j> + i) 1 rip) r(p + i) p'^r(p)’ 130 12.5-6 and hence for integral Talnes of p An application of the Enler-MacLanrin formula (12.2.8) gives n n Taking p = n in the second relation (12.6.4), we thus obtain (cf p. 128; (12.6.7) r'(20 __ /r(n)v« _ V i _ ^ _ Vi rln) \rin)) 6 12.6. Orthogonal polynomials. — Let F{x) be a distribution function with finite moments (cf 7.4) of ail orders. We shall say that Xq is a point of increase for F{x)y if F{xo -h k) > F(xq — h) for every h > 0. Suppose first that the set of all points of increase of F is infinite. We shall then show that there exists a sequence of polynomials Po(^}i PiM> • • • uniquely determined by the following conditions: a) pn (ir) is of degree w, and the coefficient of x” in pn (ir) is positive. b) The Pn (a?) satisfy the orthogonality conditions / /X / X X f 1 m = ny Pm{x)pn{x)dF(x)=\ ( 0 for n. — ao The pn(x) will be called the orthogonal polynomials associated with the distribution corresponding to F(x), We first observe that for any n ^ 0 the quadratic form in the w 4- 1 variables Wqi • • • » 00 n J {Uo + ttiX -i [- UnX'')*dF{x) = 2 «<+*«*«* — 00 {, jfc=0 is definite positive. For by hypothesis F{x} has at least n + 1 points of increase, and at least one of these must be different from all the n zeros of Wq + " ' «n x^y so that the integral is always positive as long as the Ui are not all equal to zero. It follows (cf 11.10) that the determinant of the form is positive: Oq ... On Cn+i . . . CP2n 131 12.6 Obviously we must have Po(x] == 1. Now write Pn{x) — Uq -h ‘ + Un X^, where > 0, and try to determine the coefficients «, from the condi tions a) and b). Since every piix) is to have the precise degree any power can be represented as a linear combination of Po{x), . . .fPt(x). It follows that we must have 00 f X^Pn [x) dF(x)=^Q for i = 0, 1, . . « — 1. Carrying out the integrations, we thus have n linear and homogeneous equations between the w + 1 unknowns Uq, . . Mn, and it follows that any polynomial Pn[x) satisfying our conditions must necessarily be of the form ( 12 . 0 . 1 ) Pn(x)--=K «() • . an (Xn-l CCn . . 0f2n~l |i X . . . X” where if is a constant. For K this polynomial is of precise degree n, as the coefficient of x'^ in the determinant is I)n-i > 0. Thus pn[x) is uniquely determined by the conditions that J pi (IF-- 1 and that the coefficient of a?” should be positive.^) We have thus established the existence of a uniquely determined sequence of ortho- gonal polynomials corresponding to any distributicm with an infinite number of points of increase. If F[x) has only N points of increase, it easily follows from the above proof that the pn[x) exist and are uniquely determined for w = 0, 1, . . ., N — 1. The determinants Dn are in this case still posi- tive for M = 0, 1, . . ., W — 1, but for N we have Dn = 0. Consider in particular the case of a distribution with a continuous frequency function f[x) = F’ {x), and let Pq[x), ... be the corresponding orthogonal polynomials. If g{x) is another frequency function, we may try to develop g{x) in a series (12,6.2) g (x) = hoPo{x)f[x) + biPi(x)f(x) + ■■■ ’) It can be shown that jr = (J?n-i Cf e. g. Szegh, Kef. 86. 132 12.6 Multiply with Pn(x) and suppose that we may integrate term by term. The orthogonality relations then give 00 (12.6.3) 6n = f pn{x)g(x)dx. — 00 Thus in particular = 1 . Expansions of this ty pe may sometimes render good service for the analytic representation of distributions. — We shall now give some examples of orthogonal polynomials. 1. The Hermite polynomials Hn(x) are defined by the relations (12.6.4) (” = 0, 1, 2, . . .). Hn[x) is a polynomial of degree », and we have Hq (x)= 1 , (x) = X, Hi {x) = x^ — 1, (12.6.5) i/g (x) = x^ — 3x, [x] = x^ — 6 x^ + 3, H,(x) = a;® - + Ibx, H^{x) = x^ - Ibx^ + 45a;* - 15, By repeated partial integration, we obtain the relation 00 00 (12.6.6) f HUx) Hn[x) d<D{x)=J^ f ff„(x) If„ix) e"Tda;=i”' J V27tJ to form?^w, which shows that is the sequence of orthogonal poly- nomials associated with the normal distribution defined by (10.5.3). We also note the expansions (12.6.7) and ( 12 . 6 . 8 ) v\ ^ HAx)HAy) Zi f »= -T=i_-z=e Vi — t* i*a:*+(*v*— 2f 7p (|il< 1). The first of these follows simply from the definition (12.6.4). A proof of (12.6.8) given by Cramer will be found in Charlier, Ref. 9 a, p. 50 — 53. 2. The Lagnerre polynomials are defined by the relations 133 12.6 which give Lyix) = x — l. ^ , x*-2a+ l)x + ACi + l) it(x)= 5 , By repeated partial integration we find at ( X ) 4 " .((■•: « 0 for m ^ n. BO that f ] rc+r-)j - the sequence of orthogonal polynomials associated with the distribution defined by the frequency function f(x; a, X) considered in (12.3.3), when we take ec = 1. 3. Consider the distribution obtained by placing the mass ^ in each of the N points Xi, Xf, . . .yX^. The corresponding distribution function is a step-function with a step of height ^ in each Xi. Let associated orthogonal polynomials, which according to the above are uniquely determined. The orthogonality relations then reduce to N N 2 Pm (a:,) ?»(*£) i=l ^ I for m — w, ( 0 for m 7^ n. These polynomials may be used with advantage e. g. in the following problem. Sup> pose that we have N observed points (a* 2 , y^)y . . y^\ and want to find the para- bola y — q(x) of degree n < Ny which gives the closest Jit to the observed ordinates, in the sense of the principle of least squares, i. e. such that JV l'=I becomes a minimum. We then write q{x) in the form q{x) = CoPq{x) -+■•••+ Cnpn{x)y and the ordinary rules for finding a minimum now immediately give JV »=1 for r = 0, 1 , . . . , n, while the corresponding minimum value of U is JV Vmia = - c5 -- C*- - Cn- f = l The case when the points are equidistant is particularly important in the applica- tions. In that case, the numerical calculation of q{x) and I7min may be performed with a comparatively small amount of labour. Of e. g. Esscher (Kef. 82) and Aitken (Ref. 60). — Cf further the theory of parabolic regression in 21.6. 134 SECOND PART RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Chapters 13—14. Foundations. CHAPTER 13. Statistics and Probability. 13.1. Random experiments. — In the most varied fields of practi- cal and scientific activity, cases occur where certain experiments or observations may be repeated a large number of times under similar circumstances. On each occasion, our attention is then directed to a remit of the ohfier ration^ which is expressed by a certain number of characteristic features. In many cases these characteristics directly take a quantitative form: at each observation something is counted or measured. In other cases, the characteristics are qualitative: we observe e. g. the colour of a certain object, the occurrence or non-occurrence of some specified event in connection with each experiment, etc. In the latter case, it is always possible to express the characteristics in numerical form according to some conventional system of notation. Whenever it is found convenient, we may thus always suppose that the result of each observation is expressed by a certain number of quantities. 1 . If we make a series of throws with an ordinary die, each throw yields as its result one of the nniiibers 1, 2, . . ., 6. 2. If we measure the length and the weight of the body of each member of a group of animals belonging to the same speeies, every individual gives rise to an observation, the result of which is expressed by two numbers. 3. If, in a steel factory, we take a sample from every day’s production, and measure its hardness, tensile strength and percentage of coal, sulphur and phosphorus, the result of each observation i.s given by five numbers. 4. If we observe at regular time intervals the prices of k different commodities, the result of each observation is expressed by k numbers. 5. If we observe the sex of every child horn in a certain district, the result of each observation is not directly expressed by numbers. We may, however, agree to denote the birth of a boy by 1, and the birth of a girl by 0, and thus conventionally express our results in numerical form. 137 13.1-2 In some cases we know the phenomenon nnder investigation suffi- ciently well to feel justified in making exact predictions with respect to the result of each individual observation. Thus if our experiments consist in observing, for every year, the number of eclipses of the sun visible from a given observatory, we do not hesitate to predict, on the strength of astronomical calculations, the exact value of this number. A similar situation arises in every case where it is assumed that the laws governing the phenomena are known, and these laws are sufficiently simple to be used for calculations in practice. In the majority of cases, however, our knowledge is not precise enough to allow of exact predictions of the results of individual observations. This is the situation, e. g., in all the examples 1 — 5 quoted above. Even if the utmost care is taken to keep all relevant circumstances under control, the result may in such cases vary from one observation to another in an irregular way that eludes all our attempts at prediction. In such a case, we shall say that we are concerned with a sequence of random experiments. Any systematic record of the results of sequences of this kind will be said to constitute a set of statistical data relative to the pheno- menon concerned. The chief object of statistical theory is to in- vestigate the possibility of drawing valid inferences from statistical data^ and to work out methods by which such inferences may be obtained. As a preliminary to the discussion of these questions, we shall in the two following paragraphs consider some general properties of random experiments. 13 2. Examples. — It does not seem possible to give a precise definition of what is meant by the word »random». The sense of the word is best conveyed by some examples. If an ordinary coin is rapidly spun several times, and if we take care to keep the conditions of the experiment as uniform as possible in all respects, we shall find that we are unable to predict whether, in a particular instance, the coin will fall )^heads» or »tails». If the first throw has resulted in heads and if, in the following throw, we try to give the coin exactly the same initial state of motion, it will still appear that it is not possible to secure another case of heads. Even if we try to build a machine throwing the coin with perfect regularity, it is not likely that we shall succeed in predicting the results of individual throws. On the contrary, the result of the ex- periment will always fluctuate in an uncontrollable way from one instance to another. 138 13.2 At first, this may seem rather difficult to explain. If we accept a deterministic point of view, we must maintain that the result of each throw is uniquely determined by the initial state of motion of the coin (external conditions, such as air resistance and physical properties of the table, beingf regarded as fixed). Thus it would seem theoretic- ally possible to make an exact prediction, as soon as the initial state is known, and to produce any desired result by starting from an appropriate initial state. A moment’s reflection will, however, show that even extremely small changes in the initial state of motion must be expected to have a dominating influence on the result. In practice, the initial state will never be exactly known, but only to a certain approximation. Similarly, when we try to establish a perfect uniformity of initial states during the course of a sequence of throws, we shall never be able to exclude small variations, the magnitude of which depends on the precision of the mechanism used for making the throws. Between the limits determined by the closeness of the approximation, there will always be room for various initial states, leading to both the possible final results of heads and tails, and thus an exact pre- diction will always be practically impossible. — Similar remarks apply to the throws with a die quoted as Ex. 1 in the preceding paragraph, and generally to all ordinary games of chance with dice and cards. According to modern biological theory, the phenomenon of heredity shows in important respects a striking analogy with a game of chance. The combinations of genes arising in the process of fertilization seem to be regulated by a mechanism more or less resembling the throwing of a coin. In a similar way as in the case of the coin, extremely small variations in the initial position and motion of the gametes may produce great differences in the properties of the offspring. Accordingly we find here, e. g. with respect to the sex of the offspring {Ex. 5 of the preceding paragraph), the same impossibility of indivi- dual prediction and the same » random fluctuations » of the results as in the case of the coin or the die. Next, let us imagine that we observe a number of men of a given age during a period of, say, one year, and note in each case whether the man is alive at the end of the year or not. Let us suppose that, with the aid of a medical expert, we have been able to collect detailed information concerning health, occupation, habits etc. of each ob- served person. Nevertheless, it will obviously be impossible to make exact predictions with regard to the life or death of one particular 139 13.2 person, since the causes leading: to the ultimate result are far too numerous and too complicated to allow of any precise calculation. Even for an observer endowed with a much more advanced biological knowledge than is possible at the present epoch, the practical con- clusion would be the same, owing to the multitude and complexity of the causes at work. In the examples 2 and 4 of the preceding paragraph, the situation seems to be largely analogous to the example just discussed. The laws governing the phenomena are in neither case very well known, and even if they were known to a much greater extent than at present, the structure of each case is so complicated that an individual predic- tion would still seem practically impossible. Accordingly, the observa- tions show in these cases, and in numerous other cases of a similar nature, the same kind of random irregularity as in the previous examples. It is important to note that a similar situation may arise even in cases where we consider the laws of the phenomena as perfectly known, provided that these laws are sufficiently complicated. Consider e. g. the case of the eclipses of the sun mentioned in the preceding para- graph. We do assume that it is possible to predict the annual num- ber of eclipses, and if the requisite tables are available, anybody can undertake to make such predictions. Without the tables, however, it would be rather a formidable task to work out the necessary calcu- lations, and if these difficulties should be considered insurmountable, prediction would still be practically impossible, and the fluctuations in the annual number of eclipses would seem comparable to the fluctuations in a sequence of games of chance. Suppose, finally, that our observations consist in making a series of repeated measurements of some physical constant, the method of measurement and the relevant external conditions being kept as uni- form as possible during the whole series. It is well known that, in spite of all precautions taken by the observer, the successive measure- ments will generally yield different results. This phenomenon is com- monly ascribed to the action of a large, number of small disturbing factors, which combine their effects to a certain total » error* affecting each particular measurement. The amount of this error fluctuates from one observation to another in an irregular way that makes it impos- sible to predict the result of an individual measurement. — Similar considerations apply to cases of fluctuations of quality in manufac- tured articles, such as Ex. 3 of the preceding paragraph. Small and 140 13.2-3 uncontrollable variations in the production process and in the quality of raw materials will combine their effects and produce irregular fluctuations in the final product. The examples discussed above are representative of large and important groups of random experiments. Small variations in the initial state of the observed units, which cannot be detected by our instruments, may produce considerable changes in the final result. The complicated character of the laws of the observed phenomena may render exact calculation practically, if not theoretically, impossible. Uncontrollable action by small disturbing factors may lead to irregular deviations from a presumed »true value». It is, of course, clear that there is no sharp distinction between these various modes of randomness. Whether we ascribe e. g. the fluctuations observed in the results of a series of shots at a target mainly to small variations in the initial state of the projectile, to the complicated nature of the ballistic laws, or to the action of small disturbing factors, is largely a matter of taste. The essential thing is that, in all cases where one or more of these circumstances are present, an exact prediction of the results of individual experiments becomes impossible, and the irregular fluctuations characteristic of random experiments will appear. We shall now see that, in cases of this character, there appears amidst all irregularity of fluctuations a certain typical form of regula- rity, that will serve as the basis of the mathematical theory of sta- tistics. 13.3. Statistical regularity. — We have seen that, in a sequence of random experiments, it is not possible to predict individual results. These are subject to irregular random fluctuations which cannot be submitted to exact calculation. However, as soon as we turn our attention from the individual experiments to the whole sequence of experiments, the situation changes completely, and an extremely im- portant phenomenon appears: In spite of the irregular behaviour of individual results, the average results of long sequences of random ex- periments show a striking regularity. In order to explain this important mode of regularity, we consider a determined random experiment @, that may be repeated a large number of times under uniform conditions. Let S denote the set of all a priori possible different results of an individual experiment, while S denotes a fixed subset of S. If, in a particular experiment, 141 13 ^ we obtain a result | belonging to the subset 8, we shall say that the event defined by the relation | c 5, or briefly the event S C S, has occurred.*) We shall often also denote an event by a single letter E, writing E = E(^ < S), and we may then speak without distinction of »the event E» or »the event § < S». When onr experiment @ consists in throwing a die, the set S contains the six numbers 1, 2, . . /, 6. Let 8 denote e. g. the subset containing the three numbers 2, 4, 6. The event f C <9 then occurs at any throw resulting in an even number of points. When we are concerned with measurements of some physical constant x, the value of which is a priori completely unknown, it may be at least theoretically pos- sible for a measurement to yield as its result any real number, and accordingly the set S would then be the one-dimensional space . Let S denote e. g. the closed interval (a, 6). The event ^ < 8 then occurs every time a measurement yields a value f belonging to {a, h). Let US now repeat our experiment & a larg^e number of times, and observe each time whether the event E = E(^ < S) takes place or not. If we find that, among the n first experiments, the event E has occurred exactly v times, the ratio v/n will be called the frequency ratio or simply the frequency of the event E in the sequence formed by the n first experiments. Now^ if we observe the frequency v/n of a fixed event E for increasing values of w, we shall genei^ally find that it shows a marked tendency to become more or less constant for large values of n. This phenomenon is illustrated by Fig. 3, which shows the varia- tion of the frequency v/n of the event » heads » within a sequence of throws with a coin. As shown by the figure, the frequency ratio fluctuates violently for small values of w, but gradually the amplitude of the fluctuations becomes smaller, and the graph may suggest the impression that, if the series of experiments could be infinitely con- tinued under uniform conditions, the frequency would approach some definite ideal or limiting value very near to J. It is an old experience that this stability of frequency ratios usu- ally appears in long series of repeated random observations, performed under uniform conditions. For an event of the type ^ < S observed in connection with such a series, we shall thus as a rule obtain a graph of the same general character as in the particular case illustrated We assnme here that 8 is some set of simple stroctore, so that it may be directly observed whether ^ belongs to 8 or not. In the following chapter, the ques- tion will be considered from a more general point of view. 142 13.3 Fig. 3. Frequency ratio of »heads« in a sequence of throws with a coin. Logarithmic scale for the abscissa. by Fig. 3. Moreover, in a case where this statement is not true, a careful examination will usually disclose some definite lack of uni- formity in the conditions of the experiments. We ini^ht thus be tempted to advance a conjecture that, f^enerally, a frequency of the type here considered would approach a definite ideal value, if the corresponding^ series of experiments could be infinitely continued. A conjecture of this kind can, of course, neither be proved nor disproved by actual experience, since we can never perform an infinite sequence of experiments. The experiments do, however, strong:ly support the less precise conjecture that, io ant/ event E connected with a random experiment G, we should he able to ascribe a tiumher P such thaty in a long series of repetitions of G, the freqtmicy of E tvould be approxim- ately equal to P, This is the typical form of statistical regidarity which constitutes the empirical basis of statistical theory. We must now attempt to give a precise meaning to the somewhat vague expressions used in the above statement, and we shall further have to investigate the laws that govern this mode of regularity, and to show how these laws may 143 13.3 be applied in drawing inferences from statistical data. In order to carry out this task, we shall in the first place try to work out a mathematical theory of phenomena showing statistical regularity. Be- fore attempting to do this it will, however, be convenient to give in the following paragraph some general remarks concerning the nature and object of any mathematical theory of a group of empirically observed phenomena. Historically, this remarkable behaviour of frequency ratios was first observed in the field of games of chance, of which our example with the coin forms a particularly simple case. Already at an early epoch, it was observed that, in all current games with cards, dice etc., the frequency of a given result of a certain game seemed to cluster in the neighbourhood of some definite value, when the game was repeated a large number of times. The attempts to give a mathematical explanation of certain observed facts of this kind became the immediate cause of the origin (about 1650) and first development of the Mathematical Theory of Probability, under the bands of Pascal, Fermat, Huygens and James Bernoulli. A little later, the same type of regularity was found to occur in frequencies connected with various demographic data, and the theory of population statistics was based on this fact. Gradually, the field of applica- tion of statistical methods widened, and at the present time we may regard it as an established empirical fact that the »long run stability* of frequency ratios is a general characteristic of random experiments, performed under uniform conditions. In some cases, especially when we are concerned with observations on individuals from human or other biological populations, this statistical regularity is often in- terpreted by considering the observed units as samples from some very large or even infinite parent population. Consider first the ca.se of a finite population, consisting of N individuals. For any individual that comes under observation we note a certain characteristic f, and we denote by E some specified event of the type ^ c S. The frequency of in a sample of n observed individuals tends, as the size of the sample increases, towards the frequency of E in the total population, and actually reaches this value when we take n = Ny which means that we observe every individual in the wliole population. The idea of an infinite parent population is a mathematical abstraction of the same kind as the idea that a given random experiment might be repeated an infinite number of times. We may consider this as a limiting case of a finite population, when the number N of individuals increases indefinitely. The frequency of the event E in sk sample of n individuals from an infinite population w’ill always be subject to random fluctuations, ns long as n is finite, but it may seem natural to assume that, for indefinitely increasing values of n, this frequency w’ould ultimately reach a »trne» value, corresponding to the frequency of E in the total infinite population. This mode of interpretation by means of the idea of sampling may even be ex- tended to any type of random experiment. We may, in fact, conceive of any finite sequence of repetitions of a random experiment as a sample from the hypothetical infinite population of all experiments that might have been performed under the given conditions. — We shall return to this matter in Ch. 26, where the idea of sampling will be further discussed. 144 13A 13.4. Object of a mathematical theory. — When, in some group of observable phenomena, we find evidence of a confirmed regularity, we may try to form a mathematical theory of the subject. Such a theory may be regarded as a mathematical model of the body of empirical facts which constitute our data. We then choose as our starting point some of the most essential and most elementary features of the regularity observed in the data. These we express, in a simplified and idealized form, as mathematical propositions which are laid down as the basic axioms of our theory. ITrom the axioms, various propositions are then obtained by purely logical deduction, without any further appeal to experience. The logically consistent system of propositions built up in this way on an axiomatic basis constitutes our mathematical theory. Two classical examples of this procedure are provided by Geometry and Theoretical Mechanics. Geometry, e. g., is a system of purely mathematical propositions, designed to form a mathematical model of a large group of empirical facts connected with the position and con- figuration in space of various bodies. It rests on a comparatively small number of axioms, which are introduced without proof. Once the axioms have been chosen, the whole system of geometrical pro- positions is obtained from them by purely logical deductions. In the choice of the axioms we are guided by the regularities found in available empirical facts. The axioms may, however, be chosen in different ways, and accordingly there are several different systems of geometry: Euclidean, Lobatschewskian etc. Each of these is a logic- ally consistent system of mathematical propositions, founded on its own set of axioms. — In a similar way, theoretical mechanics is a system of mathematical propositions, designed to form a mathematical model of observed facts connected with the equilibrium and motion of bodies. Every proposition of such a system is true^ in the mathematical sense of the word, as soon as it is correctly deduced from the axioms. On the other hand, it is important to emphasize that no proposition of any mathematical theory jyroves anything about the events that will, in fact, happen. The points, lines, planes etc. considered in pure geometry are not the perceptual things that we know from immediate experience. The pure theory belongs entirely to the conceptual sphere, and deals with abstract objects entirely defined' by their pro- perties, as expressed by the axioms. For these objects, the proposi- tions of the theory are exactly and rigorously true. But no proposi- 145 10 — 446 H. Cramer 13.4 tion about such conceptual objects will ever involve a log^ical proof of properties of the perceptual things of our experience. Mathe- matical arguments are fundamentally incapable of proving physical facts. Thus the Euclidean proposition that the sum of the angles in a triangle is equal to tt is rigorously true for a conceptual triangle as defined in pure geometry. But it does not follow that the sum of the angles measured in a concrete triangle will necessarily be equal to tt, just as it does not follow from the theorems of classical mechanics that the sun and the planets will necessarily move in conformity with the Newtonian law of gravitation. These are questions that can only be decided by direct observation of the facts. Certain propositions of a mathematical theory may, however, be tested hy experience. Thus the Euclidean proposition concerning the sum of the angles in a triangle may be directly compared with actual measurements on concrete triangles. If, in systematic tests of this character, we find that the verifiable consequences of a theory really conform with sufficient accuracy to available empirical facts, we may feel more or less justified in thinking that there is some kind of re- semblance between the mathematical theory and the structure of the perceptual world. We further expect that the agreement between theory and experience will continue to hold also for future events and for consequences of the theory not yet submitted to direct verification, and we allow our actions to be guided by this expectation. Such is the case, e. g., with respect to Euclidean geometry. When- ever a proposition belonging to this theory has been compared with empirical observations, it has been found that the agreement is suffi- cient for all ordinary practical purposes. (It is necessary to exclude here certain applications connected with the recent development of physics.) Thus, although it can never be logically proved that the sum of the angles in a concrete triangle must be equal to tt, we regard it as practically certain — i. e. sufficiently certain to act upon in practice — that our measurements will yield a sum approximately equal to this value. Moreover, we believe that the same kind of agreement will be found with respect to any proposition deduced from Euclidean axioms, that we may have occasion to test by experience. Naturally, our relying on the future agreement between theory and experience will grow more confident in the same measure as the accumulated evidence of such agreement increases. The » practical certainty » felt with respect to a proposition of Euclidean geometry 146 13.4 will be diflFerent from that connected with, say, the second law of thermodynamics. Further, the closeness of the agfreement that we may reasonably expect will not always be the same. Whereas in some cases the most sensitive instruments have failed to discover the slightest disagreement, there are other cases where a scientific »law» only accounts for the main features of the observed facts, the deviations being interpreted as » errors » or » disturbances ». In a case where we have found evidence of a more or less accurate and permanent agreement between theory and facts, the mathematical theory acquires a practical value, quite apart from its purely mathe- matical interest. The theory may then be used for various purposes. The majority of ordinary applications of a mathematical theory may be roughly classified under the three headings: Descj'iption, Analysis and Prediction. In the first place, the theory may be used for purely descriptive purposes. A large set of empirical data may, with the aid of the theory, be reduced to a relatively small number of characteristics which repre- sent, in a condensed form, the relevant information supplied by the data. Thus the complicated set of astronomical observations concerning the movements of the planets is summarized in a condensed form by the Copern ican system. Further, the results of a theory may be applied as tools for a scientific analysis of the ])henomena under observation. Almost every scientific investigation makes use of applications belonging to this class. The general principle behind such applications may be thus expressed: Any theory which does not fit the facts must he modified. Sup- pose, e. g., that we are trying to find out whether the variation of a certain factor has any influence on some phenomena in which we are interested. We may then try to work out a theory, according to which no such influence takes place, and compare the consequences of this theory with our observations. If on some point we find a manifest disagreement, this indicates that we should proceed to amend our theory in order to allow for the neglected influence. Finally, we may use the theory in order to predict the events that will happen under given circumstances. Thus, with the aid of geo- metrical and mechanical theory, an astronomer is able to predict the date of an eclipse. This constitutes a direct application of the prin- ciple mentioned above, that the agreement between theory and facts is expected to hold true also for future events. The same principle is applied when we use our theoretical knowledge with a view to produce 147 13 , 4-5 some determined event, as e. when a ballistic expert shows how to direct a gun in order to hit the target. 13 . 5 , Mathematical probability. — We now proceed to work out a theory designed to serve as a mathematical model of phenomena showing statistical regularity. We want a theory which takes account of the fundamental facts characteristic of this mode of regularity, and which may be put to use in the various ways indicated in the pre- ceding paragraph. In laying the foundations of this theory, we shall try to imitate as strictly as possible the classical construction process described in the preceding paragraph. In the case of geometry, e. g., we know that by certain actions, such as the appropriate use of a ruler and a piece of chalk, we may produce things known in everyday language as points, straight lines etc. The empirical study of the properties of these things gives evidence of certain regularities. We then postulate the existence of conceptual counterparts of the things: the points, straight lines etc. of pure geometry. Further, the fundamental fea- tures of the observed regularities are stated, in an idealized form, as the geometrical axioms. Similarly, in the case actually before us, we know that by certain actions, viz. the performance of sequences of certain experiments, we may produce sets of observed numbers known as frequency ratios. The empirical study of the behaviour of frequency ratios gives evidence of a certain typical form of regularity, as described in 13.3. Consider an event E connected with the random experiment te. According to 13.3, the frequency of E in a sequence of 7i repetitions of G shows a tendency to become constant as n increases, and we have been led to express the conjecture that for large n the frequency ratio would with practical certainty be approximately equal to some assignable num- ber P. In owr mathematical theory ^ we shall accordmgly inb'oduce a definite number P, which will be called the ^probability of the event E with respect to the random experiment G. Whenever we say that the probability of an event E with respect to an experiment G is equal to P, the concrete meaning of this assertion will thus simply be the following: In a long series of repetitions of G, it is practically certain that the frequency of E will be approximately 148 13.5 equal to P}) — This statement will he referred to as the frequency in- terpretation of the probability P. The probability number P introduced in this way provides a concep- tual counterpart of the empirical frequency ratios. It will be observed that, in order to define the probability P, both the type of random experiment @ and the event E must be specified. Usually we shall, however, regfard the experiment @ as fixed, and we may then without ambiguity simply talk of the probability of the enent E, For the further development of the theory, we shall have to con- sider the fundamental properties of frequency ratios and express these, in an idealized form, as statements concerning the properties of the corresponding probability numbers. These statements, together with the existence postulate for the probability numbers, will serve as the axioms of our theory. — In the present paragraph, we shall only add a few preliminary remarks; the formal statement of the axioms will then be given in the following chapter. For any frequency ratio v/n we obviously have 0 ^ v/n S 1. Since, by definition, any probability P is approximately equal to some fre- quency ratio, it will be natural to assume that P satisfies the corres- ponding inequality 0 ^ P^ 1, and this will in fact be one of the properties expressed by our axioms. If E is an impossible event, i. e. an event that can never occur at a performance of the experiment @, any frequency of E must be zero; and consequently we take P = 0. — On the other hand, if we know that for some event E we have P = 0, then E is not necessarily an impossible event. In fact, the frequency interpretation of P only implies that the frequency v/n of E will for large n be approximately equal to zero, so that in the long run E will at most occur in a very small percentage of all cases. The same conclusion holds not only when P = 0, but even under the more general assumption that 0 S P < where e is some very small number. If E is an event of this type, and if the experiment ® is performed one single time, it can thus be con- sidered as practically certain that E will not occur, — This particular case of the frequency interpretation of a probability will often be applied in the sequel. Similarly, if E is a certain event, i. e. an event that always occurs at a performance of @, we take P= 1. — On the other hand, if we At a later stage (cf 10.8), we shall be able to give a more precise form to this statement. 149 13.5 know that P= 1, we cannot infer that E is certain, but only that in the long run E will occur in all but a very small percentage of cases. The same conclusion holds under the more general assumption that 1 — £ < P g 1, where b is some very small number. If E is an event of this type, and if the experiment 6 is performed one single time, it can he considered as practically certain that E will occur. With respect to the foundatioDS of the theory of probability, many different opinions are represented in the literature. None of these has so far met with uni- versal acceptance. We shall conclude this paragraph by a very brief 8 urve 3 " of some of the principal standpoints. The theory of probability originated from the study of problems connected with ordinary games of chance (cf 13.3). In all these games, the results that are a priori possible may be arranged in a finite number of cases supposed to be perfectly sym- metrical, such as the cases represented by the six sides of a die, the 62 cards in an ordinary pack of cards, etc. This fact seemed to provide a basis for a rational explana- tion of the observed stability of froqxiency ratios, and the 18:th century mathematicians were thus led to the introduction of the famous principle of equally possible cases >vhich, after having been more or less tacitly assumed by earlier writers, was ex- plicitly framed by Laplace in his classical w'ork (Kef. 22) as the fundamental prin- ciple of the whole theory. According to this principle, a division in ^equally pos- sible» cases is conceivable in any kind of oh.servations, and the probability of an event is the ratio between the number of cases favourable to the evemt, and the to- tal number of possible cases. The weakness of this definition is obvious. In the first place, it does not tell us how to decide whether two cases should he regarded as equally possible or not. Moreover, it .seems difficult, and to some minds even impossible, to form a precise idea as to how a division in eciually possible cases could be made with respect to observations not belonging to the domain of games of chance. Much work has been devoted to attemi)ts to overcome these difficulties and introduce an improved form of the classical definition. On the other hand, many authors have tried to replace the classical definition by something radically different. Modern work on this line has been largely in- fluenced by the general tendency to build any mathematical theory on an axiomatic basis. Thus some authors try to introduce a system of axioms directly based on the properties of frequency ratios. The chief exponent of this school is von Mises (Ref. 27, 28, 169), who defines the probability of an event as the limit of iht frequency v/n of that event, as n tends to infinity. The existence of this limit, in a strictly mathema- tical sense, is postulated as the first axiom of the theory. Though undoubtedly a definition of this type seems at first sight very attractive, it involves certain mathe- matical difficulties which deprive it of a good deal of its apparent simplicity. Be- sides, the probability definition thus proposed would involve a mixture of empirical and theoretical elements, which is usually avoided in modem axiomatic theories. It would, e. g., be comparable to defining a geometrical point as the limit of a chalk 150 13.5-14.1 spot of infinitely decreasing dimensions, ^hich is usually not done in modern axio> matic geometry. A further school chooses the same observational starting-point as the frequency school, bnt avoids postulating the existence of definite limits of frequency ratios, and introduces the probability of an event simply as a number associated with that event. The axioms of the theory, which express the rules for operating with such numbers, are idealized statements of observed properties of frequency ratios. The theory of this school has been exposed from a purely mathematical point of view by Kolmo- goroff (Ref. 21). More or less similar standpoints are represented by Doob, f'eller and Neyman (Ref. 76, 84, 30). A work of the present author (Ref. 11) belongs to the same order of ideas, and the present book constitutes an attempt to build the theory of statistics on the same principles. So far, we have throughout been concerned with the theory of probability, con- ceived as a mathematical theory of phenomena showing statistical regularity. Ac- cording to this point of view, the probabilities have their counterparts in observable frequency ratios, and any probability number assigned to a specified event must, in principle, be liable to empirical verification. The differences between the various schools mentioned above are mainly restricted to the foundations and the mathema- tical exposition of the subject, whereas from the point of view of the applications the various theories are largely equivalent. In radical opposition to all the above approaches stands the more general conception of probability theory as a theory of degrees of reasonable belief repre- sented e. g. by Keynes (Ref. 20) and Jeffreys (Ref. 18). According to this theory in its most advanced form given by Jeffreys, any proposition has a numerically measurable probability. Thus e. g. we should be able to express in definite numerical terms the degree of ^practical certainty* felt with respect to the future agreement between some mathematical theory and observed facts (cf 13.4). Similarly there would be a definite numerical probability of the truth of any statement such as: *The 'Masque de F’er’ was the brother of Louis XIV», »The iiresent European war will end within a year*, or »There is organic life on the planet of Mars*. Probabilities of this type have no direct connection with random experiments, and thus no obvious frequency interpretation. In the present book, we shall not attempt to discuss the question whether such probabilities are numerically measurable and, if this qiieslion could be answered in the affirmative, whether such measurement would serve any useful purpose. CHAPTER 14. Fundamental Definitions and Axioms. 14.1. Random variables. (Axioms 1—2.) — Consider a determined random experiment (£, which may be repeated a large number of times under uniform conditions. We shall suppose that the 'result of each particular experiment is given by a certain number of real quantities I». • • • . It, where k^\. 151 14.1 We then introduce a corresponding variable point or vector g = in the i-dimensional space Rk. We shall call g a i-di- mensional random variable^) Each performance of the experiment 6 yields as its result an observed value of the variable g^ the coordinates of which are the values of gi , . . - , ?it observed on that particular occasion. Let S denote some simple set of points in Rk^ saj a A*-dimensional interval (cf 3.1), and let us consider the event g < S', which may or may not occur at any particular performance of @. We shall assume that this event has a definite probability P, in the sense explained in 13.6. The number P will obviously depend on the set S, and will accordingly be denoted by any of the expressions P=:P(6^) = P(gc:S). It is thus seen that the probability may be regarded as a set func- tion^ and that it seems reasonable to require that this set function should be uniquely defined at least for all ^^ dimensional intervals. However, it would obviously not be convenient to restrict ourselves to the consideration of intervals. We may also want to consider the probabilities of events that correspond e. g. to sets obtained from intervals by means of the operations of addition, subtraction and multiplication (cf 1.3). We have seen in 2.3 and 3.3 that, by such operations, we are led to the class of Borel sets in Rk as a natural extension of the class of all intervals. It thus seems reasonable directly to extend our considerations to this class, and assume that P (S) is defined for any Borel set. It is true that when S is some Borel set of complicated structure, the event g c: S' may not be directly observable, and the introduction of probabilities of events of this type must be regarded as a theoretical idealization. Some of the con- sequences of the theory will, however, always be directly observable, and the practical value of the theory will have to be judged from the agreement between its observable consequences and empirical facts. — We may thus state our first axiom: Axiom 1. — To any random variable, g in Rk there corresponds a set function P [S) uniquely defined for all Borel sets S in Rk, such that P {S) represents the probability of the event [or relation) g < S. Throughout the exposition of the general theory, random variables will pre- ferably be denoted by the letters ^ and 37. We use heavy-faced types for multi-dimen- sional variables {k > 1), and ordinary types for one-dimensional variables. 152 14.1 As we have seen in 13.5, it will be natural to assume that any probability P satisfies the inequality 0 ^ P ^ 1 . Further, at any per- formance of the experiment S, the observed value of § must lie some- where in Rjt, so that the event g < R* is a certain event, and in ac- cordance with 13.6 we then take P(Rjt)= 1. Let now and be two sets in R* without a common point.^) Consider a sequence of n repetitions of 6, and let Vj denote the number of occurrences of the event § C /S,, * * » » » ^ ^ V » » » » » » » » § c; /Si + S2. We then obviously have v = -t Vg, and hence the correspondingf frequency ratios satisfy the relation . n n n For large values of n it is, by assumption, practically certain that the frequencies ^ ^ approximately equal to PiS^ + /Sg), P(/S,) and PiSg) respectively. It thus seems reasonable to require that the probability P should possess the additive property P(S,-h S,)^P(S,) + P(S,). The argument extends itself immediately to any finite number of sets. In order to obtain a simple and coherent mathematical theory we shall, however, now introduce a further idealization. We shall, in fact, assume that the additive property of P(S) may be extended even to an enumerable sequence of sets /S\, aSj, . . ., no two of which have a common point, so that we have P(S'i + + • ) = P{S^) 4- P(5g) • (As in the case of Axiom 1 this implies, of course, the introduction of relations that are not directly observable.) Using the terminology introduced in 6.2 and 8.2, we may now state our second axiom: Axiom 2. — The function P[S) is a non-negative and additive set function in Rk such that P{Rk)= 1. According to 6.6 and 8.4, any set function P(S') with the proper- ties stated in Axiom 2 defines a distribution in Rjt, tha£ may be con- cretely interpreted by means of a distribution of a mass unit over ') As already stated in 6.1, we only consider Borel sets. 153 14.1-2 the space Rk, such that any set S carries the mass P(5). This dis- tribution will be called the probability distribution of the random variable and the set function P{S) will be called the probability function (abbreviated pr.f.) of §. Similarly, the point function P(x) = F(xi, .... Xk) corresponding to P(S), which is defined by (6.6.1) in the case A™], and by (8.4.1) in the general case, will be called the dis- tribution function (abbreviated d.f) of g. As shown in 6,6 and 8.4, the distribution may be uniquely defined either by the set function P[S) or by the point function F{x). Finally, we observe that the Axioms 1 and 2 may be summed up in the following statement: Any random variable has a unique prob- abil iiy distribution . If, e. g., the experiment (5 consists in making a throw with a die, and observing the number of points obtained, the corresponding random variable | is a number that may assume the values 1, 2,.. , 6, and these values only. Our axioms then assert the existence of a distribution in with certain masses 7?,, placed in the points 1, 2, . . ., 0, such that represents the probability of the event ^ — r, 6 while p,. — 1. On the other hand, it is important to observe that it does not follow 1 from the axioms that p^. — for every r. The numbers p^. should, in fact, be regarded as physical constants of the particular die that we are using, and Ihe question as to their numerical values cannot be answered by the axioms of probability theory, any more than the size and the weight of the die are determined by the geometrical and mechanical axituns. However, experience shows that in a well made die the frequency of any event ^ — r in a long series of throws usually approaches I, and accordingly we shall often assume that all the p,. are equal to J, wdien the example of the die is used for purpose.s of illustration. This is, however, an assumption and not a logical conse<|uencc of the axioms. If, on the other hand, (5 consists in observing the stature ^ of a man belonging to some given group, § may assume any value within a certain part of the scale, and our axioms now' assert the exi.stence of a non-negative and additive set function P{S) in H, such that P .S) represents the probability that ^ takes a value belonging to the set S. The Axioms 1 and 2 arc, for the class of random variables here considered, equivalent to the axioms given by Kolmogoroff (Ref. 21). The axioms of Kolmogo- roff are, however, applicable to random variables defined in spaces of a more general character than those here considered. The same axioms as above were used in a work of the present author (Ref. 11). 14.2. Combined variables. (Axiom 3.) — We shall first consider a particular case. Let the random experiments (£ and S be connected with the one-dimensional random variables ^ and t] respectively. Thus the result of S is represented by one single quantity 5, while the 154 14.2 result of 3 is another quantity rj. It often occurs that we have oc- casion to consider a combined experiment ((£, f?) which consists in making, in accordance with some given rule, one performance of each of the experiments (S and s^^d observing jointly the results of both. This means that we are observing a variable point (5, i;), the co- ordinates of which are the results ^ and 17 of the experiments 6 and 55 . We may then consider the point ( 5 , rj) as representing a two-dimen- sional variable, that will be called a combined variable defined by J and 7]. The space of the combined variable is the two dimensional product space (cf 3.5) of the one-dimensional spaces of ^ and rj. Let the experiment (£ consist in a throw with a certain die, while {v consists in a throw with another die, and the combined experiment (G, consists in a throw with both dice. The result of G is a number | that may assume the values 1, 2, . . ., 6, and the same holds for the result r/ of Jv- The combined variable (f, rf) then ex- presses the joint results for both dice, and its possible >»valuesi> are the 36 pairs of numbers (1, 1\ . . ., (6, O'). If, on the other hand, the experiment G consists in observing the stature f of a married man, while J consists in observing the stature of a married woman, the combined experiment (G, may consist e. g. in observing both statures rf} of a married couple. The point r/) may in this case assume any position within a certain i)art of the plane. The principle of combination of variables may be applied to more general cases. Let the random experiments l£,, . .., be connected with the random variables § 1 , . of Zj, . . ., dimensions re- spectively, and consider a combined experiment ((5,, .. , which consists in making one performance of each and observing jointly all the results. We then obtain a combined variable ..., §„) represented by a point in the (Z*, + + Z-rO-dimensional product space (cf 3.5) of the spaces of all the The empirical study of frequency ratios connected with combined experiments discloses a statistical regularity of the same kind as in the case of the component experiments. Any experiment composed of random experiments shows, in fact, the character of a random experi- ment, and we may accordingly state our third axiom : Axiom 3. — If 5] , . . . , §n are random variables, any combined variable (gj , . . . , gn) if^' also a randoin variable. It then follows from the preceding axioms that any combined variable has a unique probability distribution in its space' of Z:, + 4- An dimensions. This distribution will often be called the joint or simul- taneous distrihufio7i of the variables §n. 155 14.2 Consider now the case of two random variables § and of and dimensions respectively. Let P] and P, denote the pr. f:B of § and 1 ^, while P denotes the pr. f. of the combined variable (§, r^). If S denotes a set in the space of the variable §, the expression P(^<z8) represents the probability that the combined variable (§, ri) takes a value belongfing to the cylinder set (cf. 3.5) defined by the relation g < S, or in other words the probability that g takes a value belonging to S, irrespective of the value of i^. Similarly, if T is a set in the space of % the expression P[riC T) represents the probability that -ij takes a value belonging to T, irrespective of the value of g. We thus havb (14.2.1) P{% <S) = P,{S), P{n d T) = P,{T), and according to (8.4.2) this shows that the marginal distributions of the (ix + i 2 )-dimen 8 ional combined distribution, relative to the sub- spaces of the variables g and r^, are identical with the distributions of g and 7] respectively. — Obviously this may be generalized to any number of component variables. When the mass in the combined distribution is projected on the subspace of any of the component variables, the marginal distribution thus obtained will always be ident- ical with the distribution of the corresponding variable. An important case of combination of variables arises when we consider a sequence of repetitions of a random experiment ($. Let us form a combined experiment by performing n times the same experiment ($, and observing all the results . . ., gn of the n repetitions. The result of this combined experiment will then be an ob- served value of the combined variable Sn), which expresses the joint results of all the n repetitions of (S. If, e. g., @ consists in a throw with a die, the corresponding one-dimensional random variable ^ has the six possible values 1,2, ..., 6 . The combined variable (£i, ■ . fn) then expresses the joint results of n successive throws, and its » values* are the 6” systems of n numbers (1, . . ., 1), . . ., (6, . ., 6). According to Axiom 3, there exists a corresponding probability distribution in Ati, with determined prob- abilities Pi . . 1 . • • • j . (j corresponding to the various possible values of the combined variable. In problems where several random variables are considered simul- taneously, we shall always assume that a rule of combination is given for all the variables that enter into the question, so that the combined variable is defined. We shall then as a rule use the symbol P[S) to denote the pr. f. of the combined variable. 156 14.3 14.3. Conditional distributions. — Let g and i] be random yari* ables of Tc^ and kf dimensions, attached to the random experiments @ and 5- Let P denote the pr. f. of the combined variable (|, i]), while S and T are sets in the spaces of § and respectively. The expres- sion P{%<. 11 T) then represents the probability of the event defined by the joint relations § < S, C T, or, in other words, the probability that the combined variable (g, 7^) takes a value belonging to the rectangle set (cf 3.5) with the sides S and T. Suppose now that P{%<. S) > 0. We then introduce a new quantity P{71 < T \ ^ < S) defined by the relation (14.3.1) P(s < S, y c: T) P(l ■= *’) Similarly, supposing that P(ii C T) > 0, we introduce another new quantity P(§ c; S | i/ c: T) by writing (U.3.2) cr S I II <c T) — (s < * I < P) — p, ^ rp\ In order to justify the names that will presently be given to these quantities, we shall now deduce some important properties of the latter. In the first place, let us in (14.3.2) consider T as a fixed set, while S is variable in the space JIa, of the variable §. The second member of (14.3.2) then becomes a non-negative and additive function of the set S. When S ~ Rk,. the rectangle set ^<Rki, ii<T is identical with the cylinder set (cf 3.5) n < T, so that the second member of (14.3.2) then assumes the value 1. Thus P(^<C S\n C T) is, for fixed T, a non-negative and additive function of the set S which for S = Rjt, assumes the value 1. In other words, P(g < fi'l c: T) is, for fixed T, the prohahility function of a certain distribution in Ra-,. In the same way it is shown that P(r^ c: P | § C 5) is, for fixed S, the pr. f. of a certain distribution in the space R*, of the variable i^. — We shall now show that, in a certain generalized sense, these quantities may in fact be regarded as probabilities having a determined frequency in- terpretation. Consider a sequence Z of n repetitions of the combined experi- ment ((£, 55)- Each of the n experiments which are the elements of Z yields as its result an observed » value » of the combined variable (g, In the sequence Z, let 157 14.3 denote the number of occurrences of the event § < /S', Vg / '■ » » -1^ C T, V » > ^ ^ 7] while Zi, Zg and Z are the corresponding sub-sequences of Z. — Obviously the third event occurs when and only when the first and second events both occur, so that Z consists precisely of the elements common to Z^ and Zg. According to the frequency interpretation of a probability (cf 13.5), it is y^ractically certain that the relations P(s<6'j = J, P(s<5,r;c:T) will, for large n, be approximately satisfied. By (14.3.1) and (14.3.2) we then have, approximately, (14.3.3) /'(,/ <T\i< S) - > P(4 < 8 1 T) = • Consider now the Vj elements of the sub-sequence Z,. These are all cases among our n repetitions, where the event § c' S has occurred. Among these, there are exactly v cases where, in addition, the event c T has occurred, viz. the v cases forming the sub-sequence Z. V Thus the ratio is the frequency of the event ii < T in the sub- »'i sequence Zi or, as we may express it, — is the conditional frequency of the event C 1\ relath'e to the hypothesis § < S. The corresponding property of the ratio is obtained by simple permutation. — The approximate relations (14.3.3) now provide a frequency interpretation of the expressions P(?^ c: T | § C S) and /'(^ aS 1 <C 7^), which will justify the introduction of the following definitions: The quantity F(i^ c: Tigers') defined J)y (14.3.1) will he called the conditional prohahility of the extent i] C T, relative to the hypothesis g d S, Accordingly, the distribution in Rk^ defined by (1 4r. 3. 1) for fixed S will be called the conditional distribution of relative to the hypothesis § d 5. — With respect to the quantity P(g d iSli/d T) defined by (14.3,2), we shall use the denominations obtained by permutation of symbols. It should be well observed that each conditional probability is hereby 158 14.3-4 defined oJily in the case when the probability of the corresponding hy- pothesis is different from zero. When P(g c: 5) and P('ij C P) are both difiFerent from zero, we obtain from (14.3.1) and (14.3.2) the relation (14.3.4) P(§ C ri < P) = P(§ < S) P(ri C T I § < .S’) - -P(r^<r)P(§c: Sheer). In the example considered in the preceding paragraph, where ^ is the stature of a married man, and 17 the stature of his wife, the data corresponding to all observed values of ^ determine the distribution of Thns e. g. the probability of the relation a < ^ ^ b will he approximately determined by the frequency of the corresponding event in the totality of our data. Suppose now that we select from our data the subgroup of all cases where tj is larger than some given constant c. The data corresponding to the values of ^ in the rases belonging to this subgroup determine the conditional distribution of relative to the hypothesis ri> c. Thus e. g. the frequency of the event a < ^ b within the subgroup is a conditional frequency as defined above, and for a large number of observations this becomes, with practical certainty, approximately equal to the con- ditional probability of the relation a < | ^ /a relative to the hypothesis rj > c. Here the set S is the interval a < ^ b, while the set T is the interval r/ > r. It is evident that, in this case, we have reason to suppose that the conditional probability will differ from the probability in the totality of the data, since the taller women corresponding to the hypothe.sis rj > c may on the average be expected to choose, or be chosen by, taller husbands than the shorter women. On the other baud, let ^ still stand for the stature of a married man, while 7/ denotes the stature of the wife belonging to the couple imwediaiely following ^ in the population register from which our data are taken. In Ibis case, there will be no obvious reason to expect the conditional probability of the relation a < ^ ‘ relative to the hypothesis 7/ > r, to be different from the unconditional probability l\a < ^ < b). On the contrary, wc should expect the conditional distribution of ^ to be independent of any hypothesis made with respect to 7/, and conversely. If this condition is satisfied, we are concerned with the case of indepeyident variables, that will be discussed in the following paragraph. 14.4. Independent variables. — An important particular case of the concepts introduced in the preceding paragraph arises when the multiplicative relation (14.4.1) P(§ c: s, < T) = P(| < s) Fin < T) is satisfied for any sets S and T. The relations (14.3.1) and (14.3.2) show that this implies P(§<5hc;5’) = P(|<5) 159 (14.4.2) if P(7j < T) > 0, 14.4 (14.4.3) P(ii <T\gczS)^ Pin C T) if P(§ <S)>0, 80 that the conditional distribution of § is independent of any hypothesis made ivith respect to and conversely. In this case we shall say that g and n are independent random variables, and that the events g < and < T are independent events. Conversely, suppose that one of the two last relations, say (14.4.2), is satisfied for all sets S and T such that the conditional probability on the left-hand side is defined, i. e. for P(ij C T) > 0. It then fol- lows from (14.3.2) that the multiplicative relation (14.4.1) holds in all these cases. (14.4.1) is, however, trivial in the case P(7^ C T) = 0, since both members are then equal to zero. Thus (14.4.1) holds for all S and T. and hence we infer (14.4.3). Thus either relation (14.4.2) or (14.4.3) constitutes a necessary and sufficient condition of independence. We shall now give another necessary and sufBcient condition. Let Pj and Pj5 denote the probability functions of g and ij, while the distribution functions of g, q and (g, i^) are l\ [x) Xt) = P, (I, ^ , . . . , ^ Xk), P’s (y) = p’s (y, , . . . , yk,) = Ps (i?i ^ .Vi , • . • , i]k, S yk,i F{x.y) 7<’(aj, , . . . , Xk„ y, yk,) = P(^,^x,, ry g yj\ for all / -1,2,..., and i — 1, 2, . . ., Atj,. According to (14.2.1), the multiplicative relation (14.4.1) may be written (14.4.4) P(| < 6’, n<T)== J\ iS) PAT). Now it has been shown in 8.6 that, when Pj and Pg are given pr. f:8 in the spaces of g and i^, there is one and only one distribution in the product space satisfying (14.4.4), viz. the distribution defined by the d. f. (14.4.5) P’(*,y) = F,(*)Fs(y). Thus (14.4.5) is a necessary and sufficient condition for the independence of the variables g and Consider now the case of n random variables gj, . . ., gn, with pr. f:s Pj, . . . , Pn and d. f:s P^, . . . , Pn* Let P and P denote the pr. f. and the d. f. of the combined variable (gi, . . §n). In direct general- ization of the above, we shall say that §i, • . . gn are independent random variables, if the multiplicative relation 160 14.4 (14.4.6) c: . . . , c: «„)== JJ p(g, c: 5.) = JI Pr(Sr) r=l r=l is satisfied for any sets 8^, . . Sn. Using the final remark of 8.6, we find that the condition (14.4.5) may be directly generalized, so that in the present case the relation F=F^I\‘ Fn is a necessary and sufficient condition of independence. — If §r and the combined variable (§ 1 ) • • • > §r— i) are independent for r = 2, 3, . . . , w, then , §n are in- dependent. This follows directly from the independence definition (14.4.6). If, in a sequence g,, gg, . any group gi, • . gn of n variables are independent, we shall briefly say that g,, gg, . . . form a sequence of independent xmriahles. — An important case of a sequence of this type arises when we consider a sequence of repetitions of a random experiment (5. If the conditions of the successive experiments are strictly uniform, the probability P of any specified event connected with, say, the w:th experiment cannot be supposed to be in any way influenced by the results of the w — 1 preceding experiments. This implies, however, that the distribution of the random variable gn con- nected with the w:th experiment is independent of any hypothesis made with respect to the value assumed by the combined variable (gi , . . . , gn~i), so that gn and (gj, . . ., gn— i) are independent. According to the above, it then follows that g,, gg, ... form a sequence of independent vari- ables. A sequence of repetitions of a random experiment (5 showing a uniformity of this character will be briefly denoted as a sequence of independent repetitions of ®. When nothing is said to the contrary, we shall always assume that any sequence of repetitions that we may consider is of this type. Consider a combined experiment consisting of two throws with a certain die. Let us repeat this combined experiment a large number of times, the conditions of each single throw being kept as uniform as possible. We may then study the be- haviour of the conditional frequency of any given result of the second throw, relative to any hypothesis made with respect to the result of the first throw. Long experi- ence has failed to detect any kind of influence of such hypotheses on the behaviour of the conditional frequency, and it seems reasonable to assume that the random variables connected with the two throws are independent. The same situation arises when we consider a combined experiment consisting of n throws, where n may have any value, and accordingly we assume that a sequence of throws made under uniform conditions form a sequence of independent repetitions, in the sense stated above. Suppose now that, in each throw, all the six possible results have the probabi- lity J. Then by (14.4.6) each of the 6^ possible results of n consecutive throws will have the probability (J)”. 11—464 H.Cramir 161 14.4-5 Finally, let us consider n independent variables , . . . , §n- If, in the multiplicative relation (14.4.6), we allow a certain number of the sets Sr to coincide with the whole spaces of the corresponding vari- ables, it follows that any group of n^<n of the variables are inde- pendent. The converse of the last proposition is not true. We shall, in fact, give an example due to S. Bernstein of three one-dimensional variables 17, such that any two of the variables are independent, while the three variables rj, ^ are not independent. Let the three-dimensional distribution of the combined variable (£, rj, Q be such that each of the four points (1,0, 0) ( 0 . 1 , 0 ^ ( 0 . 0 , 1 ) ( 1 . 1 , 1 ) carries the mass It is then easily verified that any one-dimensional marginal dis- tribution has a mass equal to in each of the two points 0 and 1, while any two- dimensional marginal distribution has a mass equal to i in eal^h of the four points (0,0), (1,0), (0, 1) and (1, 1). It follows that any two of the vari&blcs are independent. We have e. g p(g = i, , = j) = p(f = i)P(, = i) = (j)* = i and it is seen without difficulty that the analogous relation holds for any events f and 17 ^ T, so that (14,4.1) is satisfied. But the three variables r/, 5 not independent, as we have P(|=l, i?-l, ?=!) = J but = 1 ) P{tj - 1 ) P(? =- 1 ) - (i)» I 14.5. Functions of random variables. — Consider first the case of a one-dimensional random variable 5 with the pr. f. P. Suppose that, at each performance of the random experiment to which ^ is attached, we do not observe directly the variable ^ itself, but a certain real- valued function g(^), which is finite and uniquely defined for all real As usual we assume that g(^) is P-measurable (cf 5.2). The equation ^ (5) defines a correspondence between the vari- ables § and fj. Denote by F a given set on the 17 -axis, and by X the corresponding set of all § such that 7 / = g\^) <. Y. It has been shown in 5.2 that the set X corresponding to any Borel set Y is a Borel set. When X and Y are corresponding sets, we have 17 c: Y when and only when 5 C X, so that the two events rj < Y and g c; X are completely equivalent. The latter event has, by Axiom 1, a definite probability P(X), and thus the event 17 c: Y has the same probability. 162 14.5 We thus see that any function f} = g(^) of the random variable ^ is itself a random variable, with a probability distribution determined by the distribution of In fact, if Q denotes the pr. f. of rj, it follows from the above that we have for an3^ Borel set Y (14.5.1) V(r) = P(X), where A" is the set corresponding to I'. If, in particular, we choose for the set Y the closed interval (— oo, y), and denote by Sy the set of all ^ such that rj — g (5) ^ //, it follows that the d. f . of the variable t] is (14.5.2) G(?/)- (;)(i2^»/)-P(S,). J.pt the ^-distribution be interpreted in the usual way as a distribution of mass on the ^-axis. Lot us imagine that every mass particle in this distribution is moved from its original place on the ^axis, first in a vertical direction until it reaches the curve fl — g and then horizontally towards the ?^-axis. The distribution on the g- axis generated in this way will be the distribution defined by (1 4.6.1 1 The above considerations are immediately extended to any number of dimensions. Let ^ ~ (^i, . . ., be a random variable in a. /-dimen- sional sjiace Rj, with the pr. f. P. Consider a dimensional vector function which is finite and uniquely defined for all ^ in Rj, and is itself represented by a point in a X:-dimensional space Rji. We assume that any coiii])onent 77,, of 7 ^ is a j^-measurable function (cf 9.1) of the variables f, , . . It then follows as in the one-dimensional case that is a random variable in Rji , with a pr. f. y determined by the relation (14.5.1) where, now, L denotes any given set in Rn , while X is the corresponding set of all ^ in Rj such that 'i f/(^) y For a set )" such that the corresponding set A" is empty, we obtain, of course, ) ~ — The condition that (/(^) should be finite and uniquely defined for all $ in Rj may obviously be replaced by the more general condition that the ])oints § where g(§} is not finite or not uniquely defined, should form a set S such that /^(S) -= 0. As an example, we may take ^r), where r<j, so that g(^) is simply the projection of the point | on a certain subspace (cf 3.5) of r dimensions. The pr. f. of g(^) is then (^>(1') - where Y is a set in the subspace, while X is the cylinder set (cf 3.5) ill Rj defined by the relation (^1, . . . , £r, 0, , , . , 0) c: I'. The corre- sponding distribution is the marginal distribution (cf 8.'4) of (?i , . • . , ?r), which is obtained by projecting the original distribution on the r- dimensional subspace. Taking, in particular, r— 1, it is seen that 163 14.5-6 every component of the random variable § is itself a random variable^ with a marginal distribution obtained by projecting the original distribution on the axis of A function == ^ (Si , • • . , Sn) of n random variables may be regarded as a function of the combined variable (gj, . . §n). Thus according to the above is always a random variable, with a probability dis- tribution uniquely determined by the simultaneous distribution of • • • > • If §1 , . . . , are independent variables, it is immediately seen that the variables Qi (§i), . . . , (§n) are also independent. 14.6. Conclusion. — The contents of the present chapter may be briefly summed up in the following way. — From the domain of empirical data connected with random experiments, we have selected the fundamental fact of statistical regularity, viz. the long run sta- bility of frequency ratios. In our mathematical theory, we have idealized this fact by postulating the existence of conceptual counter- parts of the frequency ratios: the mathematical pi-obabilities. The process of idealization has then been carried one step further by our assumption that the additive property of the probabilities may be extended from a finite to an enumerable sequence of »events». In this way, we have reached the concept of a random variable and its probabili ty distribu ti on . We have further introduced the assumption that any number of random experiments may be joined to form a combined random ex- periment, showing the same kind of statistical regularity as the component experiments. Thus we have obtained the idea of the joint probability distribution of a number of random variables. The study of certain conditional frequencies has led us to intro- duce their conceptual counterparts, under the name of conditional probabilities. These are connected with a certain conditional distribu- tion of a random variable, which in a particular case gives rise to the important concept of independent random variables. Finally, it has been shown that a j?-^leasurable function of any number of random variables is itself a random variable, with a pro- bability distribution uniquely determined by the joint distribution of the arguments. We have thus laid the foundations for a purely mathematical theory of random variables and probability distributions. Our next object will now be to work out this theory in detail, and the rest of 164 14.6 Part II will be devoted to this purpose. In Chs 15 — 20 we shall mainly be concerned with variables and distributions in one dimen- sion, while the multi-dimensional case will be dealt with in Chs 21—24. In Part III, we shall then turn to questions of testing the mathe- matical theory by experience, and using the results of the theory for purposes of statistical inference. 165 Chapters 15-20, Variables and Distributions in J?j. CHAPTER 15. General Properties. 15.1. Distribution function and frequency function. — Consider a one-dimensional random variable By Axioms 1 and 2 of 14.1, ^ possesses a definite probability distribution in This distribution may be concretely interpreted as the distribution of a unit of mass over in such a way that the mass quantity P{S) allotted to any Borel set S represents the probability that the variable ^ takes a value belon^in^ to S. As we have seen in 6.6, we are at liberty to define the distribu- tion either by the non-negative and additive set function P(S), which is called the probability function (abbreviated pr.f.) of the variable or by the corresponding point function P(jj defined by the relation P(? = F(xl which is called the distribution function (abbreviated d.f) of f. In the present case of a one-dimensional distribution, we shall practically always use F[x). The reader is referred to the discussion of the general properties of a d.f. given in 6.6. In particular it has been shown there that any d.f. F[x) is a non-decreasing function of x, which is everywhere continuous to the right, and is such that F[ — oo ) = 0 and P( + x; ) = 1 . The difference F[b) — F[a) represents the probability that the variable ^ takes a value belonging to the interval a < ^ ^ ft: P(fl < ? ft) = F[h) - F{a). If Xq is a discontinuity point of F{x\ with a saltus equal to it follows from 6.6 that the mass p^ is concentrated in the point Xy, which means that we have the probability po variable ^ takes the value Xq\ 166 15.1 If, on the other hand, the derivative I* (x) = /(x) exists in a certain point x, then f[x) represents the density of mass at this point, and we shall call f(x) the prohahility denfsity or the frequency function (abbreviated /r./.) of the variable. The probability that the variable f takes a value belongfingf to the interval x<^<x + ^x is then for small Jx asymptotically equal to f[x)Jx, which is written in the usual differential notation P(x < § < X -f dx) = f[x) dx. This differential will be called the prohahility element of the dis- tribution. Any function r} = g[^ of the random variable 5 is, by 14.5, itself a random variable, with a d. f. given by (14.5.2). We shall consider two simple examples, that will often occur in the sequel. In the case of a linear function ly = + 6, the relation 17 ^ is equivalent to ^ ^ [y — h)la or to § ^ (y — h)la, according as a > 0 or a < 0. It then follows from (14.5.2) that rj has the d. f. (15.1.1) if fl > 0, if fl < 0, where F{t) denotes the d. f. of The formula for G (y) in the case a <0 is, however, only valid if {y — b)la is a continuity point of F. In a discontinuity point, the function should, according to our usual convention, be so determined as to be always continuous to the right. If the fr. f. f{x) = F'(x) exists for all values of x, it follows that rj has the fr. f . (15.1.2) Next, we consider the function 7 ] = The variable t} is here always non-negative, and for y > 0 the relation rj ^ y is equivalent to — VyS^S Vy. Consequently rj has the d. f . G 0 F(V'y]~F{-ry) 167 for y < 0, for y S 0. (15.1.3) 15.1-2 This time, the last expression is valid only if — Ky is a continuity point of F, If the fr.f. f{x)=^F'[x) exists for all x, it follows that Tj has the fr. f. (15.1.4) g(y) = G'(y) = 2V~y iAy^)+f(-Vy)) for y < 0, for j/ > 0. Other simple functions may be treated in a similar way. 15.2. Two simple types of distributions. — In the majority of problems occurring in statistical applications, we are concerned with distributions belonging to one of the two simple types known as the discrete and the continuous type. 1. The discrete type. A random variable f will be said to be of the discrete type, or to possess a distribution of this type, if the total mass of the distribution is concentrated in discrete mass points^) and if, moreover, any finite interval contains at most a finite number of the mass points. By 6.2, the set of all mass points is finite or enumerable. Let us denote the mass points by x^, . . and the corresponding masses by Pi^ • • •• The distribution of J is then completely described by saying that, for every r, we have the proba- bility pv that I takes the value x^^ : P x<i!) = Pv . For a set S not containing any point Xv we have, on the other hand, P(5<S) = 0. Since the total mass in the distribution must be unity, we always have V The d. f. F[x) is then given by (15.2.1) = the summation being extended to all values of v such that Xv ^ x. Thus F(x} is a step-function (cf 6.2 and 6.6), which is constant over *) This corresponds to the case == 1, c, = 0 in (6.6.2). 168 15.2 every interval not containing anj point bnt has in each Xt a step of the height p^. A distribution of the discrete type may be graphically represented by means of a diagram of the function jP(x)y or by a diagram showing an ordinate of the height over each point as illustrated by Figs 4 and 5. Fig. 4. Distribution function of the discrete type. (Note that the median is indeter- minate; cf p. 178.) i X| X, ox, X, X, Fig. 5. Probabilities corresponding to the distribution in Fig. 4. In statistical applications, variables of the discrete type occur e. g. in cases where the variable represents a certain number of units of some kind. Examples are; the number of pigs in a litter, the number of telephone calls at a given station during one hour, the number of business failures during one year. In such cases, the mass points are simply the natural numbers 0, 1, 2, . . .. 2. The continuous type. A variable f will be said to be of the continuous type, or to possess a distribution of this type, if the d. f. F (x) is everywhere continuous^) and if, moreover, the fr. f./(x) == F (x) exists and is continuous for all values of x, except possibly in certain points, of which any finite interval contains at most a finite number. The d. f. F{x) is then F(x) = P{^^x) = {f(t)dt. ') This corresponds to the case c, = 0, c. =1 in (6.6.2). 169 15,2-3 The distribution has no discrete mass points, and consequently the probability that ^ takes a particular value Xg is zero for every XqI P(? = Xo) = 0. The probability that § takes a value belonging to the finite or inhnite interval (a, has thus the same value, whether we consider the inter- val as closed, open or half-open, and is given by b F{a<t<h) = F(h) - F{a) = / /(f) dt. a Since the total mass in the distribution must be unity, we always have — 00 A distribution of the continuous type may be graphically repre. sented by diagrams showing the d. f. F(x) or the fr. f. /(x), as illus- trated by Figs 6—7. The curve y^f{x) is known as the frequevcy curve of the distribution. In statistical applications, variables of the continiiouB type occur when we are concerned with the measurement of quantities which, within certain limits, may as- sume any value. Examples are: the price of a commodity, the stature of a man, the yield of a corn field. Tii such ca.ses variables are treated as continuous, although strictly speaking the actual data arc practically always discontinuous, since every measurement is expressed by an integral multiple of the smallest unit registered in our observations. Thus prices are expressed in money units, lengths may be ex])reHsed in cm and weights in kg, etc. When, for theoretical purposes, variables of this kind are considered as continuous, a certain mathematical idealization of actually observed facts is thus already implied. 15.3. Mean values. - Consider a random variable $ with the d. f. F(x), and let (/(§) be a function integrable over (— qo, oc) with re- spect to F (cf 7.2). The integral jg(j-)dF{x) — 00 has, ill 7.4, been interpreted as a weighted mean of the values of ^ (x) for all values of x, the weights being furnished by the mass quantities dF situated in the neighbourhood of each point x. Accordingly we shall denote this integral as the mean value or mathematical expectation of the random variable ^(^), and write 170 15.3 Fig. 6. Distribution function of the continuous type. (Note that the distribution has a unique median at Xq\ cf p. 178.) Fig. 7. Frequency function of the distribution in Fig. 6. The shaded area corresponds to the probability P{a < ^ li h). The distribution has a unique mode (cf p. 179) at c. The skewness (cf p. 18-1) is positive. (15.3.1) E{g(^)) = f g(x)dF(x). More generally, if § is a y-dimensional random variable with the probability function F(S), and if </{§) is a one-dimensional function {particular case ifc = 1 of 14.5) of | which is integrable over R, with respect to P[S), we define the mean value of 5r(|) by the relation 171 15.3 (15.3.2) E(g{g)) = fg(x)dP{S). For a complex-valued function fif (§)=«(§) + * &(§)» we use the same formula to define the mean value, and thus obtain ^ (9 ^§)) = (§)) iE{b (g)). When there is no risk of a misunderstanding, we shall write simply Eg[%) or E[g) instead of E(gi§)). In the case of a one-dimensional distribution of the discrete type, as defined in the preceding paragraph, the mean value, reduces ac- cording to (7.1.8) to a finite or infinite sum: ¥ while for the continuous type, assuming g[x) to be continuous except at most in a finite number of points, we obtain by (7.5.5) an ordinary Biemann integral: CD E (.9 (D) = / 9 (^)f(x) dx. — 00 The condition that g should be integrable over (— oo, go) with respect to F is, in the last two particular cases, equivalent to the absolute convergence of the series or integral representing the mean value. Thus it is only subject to this condition that the mean value exists. The condition is always satisfied in the particular case of a bounded func- tion ^(g), as pointed out in 7.4. Consider now two variables g and 1 ^, defined in the spaces R and K" of any number of dimensions, with the pr. f:s Pj and respect- ively. Let p(g) and /i(r^) be two real or complex functions such that the mean values Eg(^) and Eh(7i) both exist. We shall consider the sum < 7 (g) + h(7i). By 14.5, this sum is a random variable, which may be regarded as a function of the combined variable (g, >j). If R de- notes the space of the combined variable, -while P is the corresponding pr. f., the mean value of the sum has the expression E(g{%) + A(»p) = j {g(.x) + h(y))dP = f g(x)dP + f h{y)dP. R R R Bj (9.2.2) the last tiro integrals reduce, however, to 172 15. S / gWrfPi ==iB| 7 (§) and f h(y) dJ\ = E hiii) R' H" respectively, so that we obtain (15.3.3) E(g{^) + hii^)) E (j(^) + EA(i^). The extension of this relation to an arbitrary finite number of terms is immediate, and we thus have the following important theorem: The mean value of a sum of random variables is equal to the sum of the mean values of the terms, provided that the latter mean lvalues exist. It should be observed that this theorem has been proved without any assumption concerning the nature of the dependence between the terms of the sum. In the case of the mean value of a product, it is not possible to obtain an equally general result. Using the same nota- tions as above, we have E[gi%)h (Tj)) = J p (*) h (y) d P. R In order to reduce this integral to a simple form, tve 7iow su 2 )po^^e that § arid are independent, so that the pr. f. P satisfies the multi- plicative relation (14.4.4). By the final remark of 14.5, the vari- ables g{^) and /i(i^) are then also independent. On this hypothesis, the formula for the mean value reduces according to (9.3.1) to (15.3.4) E(g(.i)hiii)) = f ff{x)dPi ■ f k(y)dl\ = Eg{^)Eh(ii). R' R" The extension to an arbitrary finite number of factors is immediate, so that we have the following theorem: The mean value of a qwoduct of independent random variables is eqtial to the product of the meaJi values of the factors, provided that the latter mean values exist. We finally consider some simple particular cases of the preceding general relations. — If ^ is a one-dimensional random variable, such that the mean value E{^) obtained by taking g(^) ^ in (15.3.1) exists, we have for any constant a and b (15.3.5) E(a^ + b)=^aE(^) + b. Putting E(^)=^ m we have, in particular, (15. 3.(5) E(^~— m) = m — m = 0. 173 i5.a-4 Taking p (?) = ?, h{Ti) — r} in the addition theorem (15.3.3), we obtain (15.3.7) E[^ + T]) = E(^) + E(fi). If ^ and r] are independent, the multiplication theorem (15.3.4) gives (15.3.8) E(^t])=E{^)E(r]). 15.4. Moments. — The moments of a one-dimensional distribution have been introduced in 7.4. If, for a positive integer v, the function X* is integrable over ( — with respect to F[x), the mean value 00 (15.4. 1 ) ar = E(^<’)=fx’’d F{x) — 00 is called the moment of order y, or simply the v:th moment, of the variable or the distribution, and we say that the v:th moment is finite or exists. Obviously 0 ^ always exists and is equal to unity. If (iv exists, the function \xY is also integrable, so that the i^:th absolute moment OD (15.4.2) = y \x\'’dF(x) exists. It follows that, if ak exists, then and /?,. exist for 0 ^ v For a distribution of the discrete type, the moments are according to 15.3 expressed by the series I and for a distribution of the continuous type by the E-iemann integral 00 a„ J x^f[x) (lx, — 00 It is only in the case when the series or integral representing the moment is absolutely coiweryent that the moment is said to exist. The first moment a, is equal to the mean value, or briefly the mean, of the variable, and will often be denoted by the letter m: =^(?) = Wi. If e denotes any constant, the quantities 174 15.4 O) E [(? — c)”] = J (a; — c)" d 7^'(x), — OD are called the moments about the point c. For c = 0, we obtain the ordinary moments. The absolute moments about c are, of course, de- fined in an analogous way. The moments about the mean m are often called the central moments. These are particularly important and de- serve a special notation. We shall write 00 (15.4.3) E [(^ — mT\ = f (x — m)* d F(x). — CO Developing the factor (x — m^, we find Mo = Mi = 0, (15.4.4) — m*, = flfg — 3 wi a« -f 2 ^4 = — 4 w ©3 -f 6 — 3 For the second moment about any point c, we have E - r)2] - E [(^ - m + - c)'I so that the second moment becomes a minimum tvhen taken about the mean. The moments of any function g (^) are the mean values of the suc- cessive powers of </(§). In the particular case of a linear function = the moment is given by the expression ar = E \{a 5 + W’i — a’' + ( i) + • • • + In 7.4, we have given a simple sufficient condition for the existence of the mo- ment of a given order k. We remark further that, when the variable ^ is hounded, i. e when finite a and h can he found such that I^{a < ^ < h) — 1, all moments are finite, and | «*■ | | a |’' -f | 6 h . / We shall now prove an important inequality for the absolute mo- ments defined by (15.4.2). The quadratic form in u and v 175 15.4 f ( r±l\* J \M|a;| * + V 1 j:| '■* / dF(a:) = /S»_i m* + 2/J»mv + /S»+i v* is evidently non-negative. Thus by 11.10 the determinant of the form is non-negative, so that we have S 0, or (15.4.5) Eeplacing here v successively by 1, 2, . . ., r, and multiplying all the inequalities thus formed, we obtain ^ or finally (15.4.6) /f (1^ = 1, 2,...). It is often important to know whether a distribution is uniquely determined by the sequence of its moments. We shall not enter upon a complete discussion of this difficult problem, but shall content our- selves with proving the following criterion that is often useful. Let «()= 1, a^, . . . he the momerits of a certain d. f F[x), all of 00 which are assumed to be finite. Suppose that the series absolutely 0 convergent for some r > 0. Then F(x) is the only d. f. that has the mo- ments cr,, org, . . .. o We shall first show that “ r” 0 as w co . If w is restricted n\ to even values, this follows directly from our hypothesis, and for odd values of n we have by (15.4.5) ,.n-lU. / ^ m! \{m — 1)! / \(w+l)! ] \ n ' which completes the proof of our assertion. — For any integer ?/ > 0 and for any real z we have the MacLaurin expansion where & denotes a real or complex quantity of modulus not exceeding unity. Hence we obtain by means of (10.1.2) the following expansion for the c. f. q)[t) of F[x)\ 176 15.4-5 m q)(i-i-h)= dF(x) — 00 ?< — 1 . ® * = j ^''e**^dFix) + J \x\''dF{T) 0 For 1 h I <C ?* the remainder tends to zero, so that ybr avy t the c.f. (p{t + h) can he developed in Taylors series, convergent for |A| < r. Taking first t = 0, we find that the series (where we have written t in the place of h) (15.4.7) = 0 represents the function cp (^) at least in the interval — r < ^ < r. In this interval, q)[t) is thus uniquely determined by the moments a^. In the points ^ = ± Jr, the series obtained by differentiating (15.4.7) any number of times is convergent, so that all the derivatives ( i 4 r) can be calculated from (15.4.7), i. e. from the moments c^. These derivatives ai3pear as coefficients in the Taylor developments of qp ( ± J r + /i), which converge and represent (p (t) for | /i | < r, so that the domain where p (t) is known is now extended to the interval — 3 r < < < .^ r. From the last developments, we can now calculate the derivatives points t = ± r, and use these as coefficients in the Taylor developments of g?(± r 4- /i), etc. In this way we may go on as long as we please, and it will be seen that by this procedure t/ie c. f. p[t) is uniquely defined hy the moments Ov for all values of t}) It then follows from the uniqueness theorem (10.3.1) that the d. f. Kr) is also uniquely determined by the or,,, and our theorem is proved. In the particular case when F(x) is the d. f. of a homtded variable, it follows from the remark made above that the conditions of the theorem are always satisfied. 15.5. Measures of location. — In practical applications it is im- portant to be able to describe the main features of a distribution by *) This is the method known as analytic continuation in the Theory of Analytic Functions. 12 — 464 H, Cramer 177 15.5 means of a few simple parameters. In the Srst p/ace, we often want to locate a distribution by findings some typical value of the variable, which may be conceived as a central point of the distribution. There are various ways of calculatingf such a typical parameter, and we shall here discuss the three most important cases, viz. the mean^ the median^ and the mode. The mean E(^)=^ m is the first moment of the distribution, and has already been defined in the preceding paragraph. In terms of our mechanical interpretation of the probability distribution as a distribu- tion of mass, the mean has an important concrete significance: it is the abscissa of the centre of gravity of the distribution (cf 7.4). This property gives the mean an evident claim of being regarded as a typical parameter. The median. — If is a point which divides the whole mass of the distribution into two equal parts, each containing the mass is called a median of the distribution. Thus any root of the equation — J is a median of the distribution. In order to discuss the possible cases, we consider the curve y=^F{x), regarding any vertical step as part of the curve, so that we have a single connected, never decreasing curve (cf Figs 4 and 6). This curve has at least one point of intersection with the straight line y = If there is only one point of intersection, the abscissa of this point is the unique median of the distribution (cf Fig. G). It may, however, occur that the curve and the line have a whole closed interval in common (cf. Fig. 4). In this case the abscissa of every point in the interval satisfies the equation F[x)= i, and may thus be taken as a median of the distribution. We thus see that every distribution has at least one median. In the determinate case, the median is uniquely defined; in the indeterminate case, every point in a certain closed interval is a median. The mean, on the other hand, does not always exist. Even in cases when the mean does exist, the median is sometimes preferable as a typical parameter, since the value of the mean may be largely influenced by the occurrence of very small masses situated at a very large distance from the bulk of the distribution. As shown in the preceding paragraph, the mean is characterized by a certain minimum property: the second moment becomes a mini- mum when taken about the mean. There is an analogous property of the median: the first absolute moment E(|^--c|) becomes a minimum ivhen c is equal to the median. This property holds even in the in- determinate case, and the moment has then the same value for c equal 178 15.5-6 to anj of the possible median values. DeDoting the median (or, in the indeterminate case, any median value) by we have in fact the relations — ^1) 2 f (c — cc)cII^'(.t) for c>(i, ix\) + 2 ^ {x — c)dF[x) » c<fi. The second terms on the right hand sides are evidently positive, ex- cept in the case when c is another median value (indeterminate case), when the corresponding term is zero.^) The proof of these relations will be left as an exercise for the reader. The mode of a distribution will only be defined for distributions of the two simple types introduced in 15.2. For a distribution of the continuous type, any maximum point Xq of the frequency function f[x) is called a mode of the distribution. A unique mode thus only exists for frequency curves jy=/(.r) having a single maximum (cf Fig. 7); such unimodal distributions occur, however, often in statistical applica- tions. When the frequency curve has more than one maximum, the distribution is called himodal or multimodal, as the case may be. — For a distribution of the discrete type, we may suppose the mass points XV arranged in increasing order of magnitude. The point Xv is then called a mode of the distribution, if pr > pv~\ and pr > The expressions unimodal, bimodal and multimodal distributions are here defined in a similar way as for continuous distributions. In tho particwliir case when the distribution is symmetric about a certain point a, we liave F{a + x) {- — x) — 1 as soon as a + ./■ are continuity points of F. It i.s then seen that the mean (if existent) and the median are both equal to a. If, in addition, the distribution is unimodal, the mode is also equal to a. 15.6. Measures of dispersion. — When we know a typical value for a random variable, it is often required to calculate some parameter giving an idea of how widely the values of the variable are spread on either side of the typical value. A parameter of this kind is called a measure of spread or dispersion. It is sometimes also called a measure of cmiccntration. Dispersion and concentration vary, of course, In the particular case when p is a discontinuity point of F, the ordinary de- finition of the integrals in the second members must be somewhat modified, as the integrals should then in both cases include half the contribution arising from the discontinuity. 179 15.6 in inverse sense; the greater the dispersion, the smaller the concentra- tion, and conversely. If our typical value is the mean m of the distribution, it seems natural to consider the second moment about the mean, as a dispersion measure. This is called the variance of the variable, and represents the moment of inertia of the mass distribution with respect to a perpendicular axis through the centre of gravity (cf 7.4). We have, of course, always /Xg ^ 0. When ^ 0, it follows from the de- finition of /ig that the whole mass of the distribution must be con- centrated in the single point m (cf 16.1). In order to obtain a quantity of the first dimension in units of the variable, it is, however, often preferable to use the non negative square root of which is called the standard deviation (abbreviated .S', f/.) of the variable, and is denoted by D (^) or sometimes by the single letter a. We then have for any variable such that the second moment exists - E{^^) - E‘(^). It then follows from (15.3.5) that we have for any constant a and h D (a 5 + b) — \a\D (^). When 5 is a variable with the mean m and the s. d. cr, we shall often have occasion to consider the corresponding standardized variable f — m , which represents the deviation of ^ from its mean «/, expressed in units of the s. d. a. It follows from the last relation and from (15.3,5) that the standardized variable has zero mean and unit s.d.: If ^ and rj are independent variables, it further follows from (15.3.8) that we have (15.6.1) D»(,t+ This relation is immediately extended to any finite number of terms. If are independent variables, we thus obtain (15.6.2) D«(?, T • • + ,^.) = D*(?,) + . • . + D»(?„). We have seen that the second moment is a minimum when taken about the mean, and the first absolute moment when taken about the median 180 15.6 (cf 15.4 and 15.5). If we use the median as our typical value, it thus seems natural to use the first absolute moment as measure of dispersion. This is called the mean deviation of the variable. Sometimes the name of mean deviation is used for the first absolute moment taken about the mean, but this practice is not to be recommended. In the same way as we have defined the median by means of the equation V[x) we may define a quantity by the equation F(^p} = p, where p is any given number such that 0<p< 1. The quantity will be called the quantile of order p of the distribution. Like the median, any quantile may sometimes be indeterminate. The quantile is, of course, identical with the median. The know- ledge of for some set of conveniently chosen values of p, such as P = i) L or == 0.1, 0.2, . . .. 0.9, will obviously give a good idea of the location and dispersion of the distribution. The quantities and are called the lower and upper quartiles, while the quanties Co i, C 0 . 2 , . • • are known as the deciles. The halved difference is sometimes used as a measure of dispersion under the name of semi-interquartile rcnige. If the whole mass of the distribution is situated within finite distance, there is an upper bound g of all points x such that F[x) 0, and a lower bound G of all x such that F(x) — 1. The interval (g, G) then contains the whole mass of the distribution. The length G — g of this interval is called the range of the distribution, and may be used as a measure of dispersion. The word range is sometimes also used to denote the interval {g, G) itself. If we know this interval, we have a fairly good idea both of the location and of the dispersion of the distribution. For a distribution where the range is not finite, intervals such as (m — a, m -h a) or (Cj, although they do not contain the whole mass of the distribution, may be used in a similar way, as a kind of geometrical representation of the location and dispersion of the distribution (cf 21.10). All measures of location and dispersion, and of other similar pro- perties, are to a large extent arbitrary. This is quite natural, since the properties to be described by such parameters are too vaguely defined to admit of unique measurement by means of a single number. 181 15.6-7 Each measure has advantages and disadvantages of its own, and a measure which renders excellent service in one case may be more or less useless in another. If, in particular, we choose the variance <7* or the s. d. a as our measure of dispersion, this means that the dispersion of the mass in a distribution with the mean w? = 0 is measured by the mean square E[^)^ Jx}dF{x). — OD The concentration of the variable § about the point w = 0 will be measured by the same quantity: the smaller the mean square, the greater the concentration, and conversely. Thus the mean square of a variable quantity is considered as a measure of the deviation of this quantity from zero. This is a way of expressing the famous principle of least squares^ that we shall meet in various connections in the sequel. — It follows from the above that there is no logical necessity prompting us to adopt this principle. On the contrary, it is largely a matter of convention whether we choose to do so or not. The main reason in favour of the principle lies in the relatively simple nature of the rules of operation to which it leads. We have, e. g., the simple addition rule (15.G.2) for the variance, while there is no analogue for the other dispersion measures discussed above. 15.7. Tchebycheff’s theorem. — We shall now prove the following generalization of a theorem due to Tchebycheff: Let r/(J) be a iwn negative function of the random variable Lhr every K > 0 tve then have (15.7.1) where P denotes as usual the pr.f. of If we denote by S the set of all ^ satisfying the inequality g (^) S; K, the truth of the theorem follows directly from the relation j 9 (x) dF S A"/ dF = KP[S). — 00 ,S It is evident that the theorem holds, with the same proof, even when ^ is replaced by a random variable § in any number of dimensions. Taking in particular g (f) == (f — w)®, K = a®, where m and a 182 15.7-8 denote the mean and the s. d. of we obtain for every A; > 0 the Bienayme^ Tchebycheff inequality : (15.7.2) This inequality shows that the quantity of mass in the distribution situated outside the interval m — A:o'<^<w + A:ffi8 at most equal to and thus gives a good idea of the sense in which a may be used as a measure of dispersion or concentration. For the particular distribution of mean m and s. d. a which has a mass 1 2 ** in each of the points x = m ± A:<r, and a mass 1 — -y in the point x = Wi, we have P(| ^ — m I 5; A:(x) = and it is thus seen that the upper limit of the probability given by (16.7.2) cannot generally be improved. On the other hand, if we restrict ourselves to certain classes of distributions, it is sometimes possible to improve the inequality (16.7.2). Thus it was already shown by Gauss in 1821 that for a unimodal distribution (cf 16.6) of the continuous type we have for every k > 0 06.7.3) where Xfj is the mode, and r' = a* + (.r<, — w)* is the second order moment about the mode. A simple proof of this relation will he indicated in Ex. 4 on p. 266. Hence we obtain the following inequality for the deviation from the mean; 06.7.4) P(U - w I « fca) J • for every k > | s |, where 8 denotes the Pearson measure of skewness defined by (16.8.3). For moderate values of | s |, this inequality often gives a lower value to the limit than (16.7.2). Thus if |»| < 0.26, the probability of a deviation exceeding So is by (16.7.4) smaller than 0.0624, while (16.7.2) gives the less precise limit 0.1111. For the probability of a deviation exceeding 4 a, the corresponding figures are 0.0336 by (16.7.4), and 0.0626 by (16.7.2). 15.8. Measures of skewness and excess. — In a symmetric distri- bution, every moment of odd order about the mean (if existent) is evidently equal to zero. Any such moment which is not zero may thus be considered as a measure of the asymmetry or skewness of the distribution. The simplest of these measures is which is of the third dimension in units of the variable. In order to" reduce this to zero dimension, and so construct an absolute, measure, we divide by a** and regard the ratio 183 15.8 (15.8.1) = as a measure of the skewness. We shall call the coefficient of skewness. In statistical applications, we often meet unimodal continuous distributions of the type shown in Fig. 7, where the frequency curve forms a »long tail* on one side of the mode, and a »8hort tail» on the other side. In the curve shown in Fig. 7, the long tail is on the positive side, and in the cubes of the positive deviations will then generally outweigh the negative cubes, so that will be positive. We shall call this a distribution of positive skewness. Similarly we have negative skewness when is negative; the long tail will then generally be on the negative side. Reducing the fourth moment to zero dimension in the same way as above, we define the coefficient of excess (] 5.8.2) = which is sometimes used as a measure of the degree of Battening of a frequency curve near its centre. For the important normal distribu- tion (cf 17.2), is equal to zero. Positive values of y^ are supposed to indicate that the frequency curve is more tall and slim than the normal curve in the neighbourhood of the mode, and conversely for negative values. In the former case, it is usual to talk of a piositive excess^ as compared with the normal curve, in the latter case of a negative excess. This usage is, however, open to certain criticism (cf 17.6). In the literature, the quantities — y\ and ™ 7, + 3 are often used instead of 7, and 7,. Many other measures of skewness and excess have been proposed. Thus K. Pearson introduced the difference between the mean and the mode, divided by the s. d.: as a measure of skewness. For the class of di.4tribution8 belonging to the Pearson system (cf 19.4\ it can be shown that - Vjiv* + 2(6y, -6y;+6) When 7, and 7, are small, this gives approximately 5 = ^ 7i or aio = w — i 7, <y. 184 15 . 8-10 The last relation also holds approximately for distributions given by the Edgeworth or Charlier expansions (cf 17.6 — 17.7). Charlier used the coefficient S ^ measure of skewness, and S = iy^ as measure of excess. 15.9. Characteristic functions. — The mean value of the particular function will be written 00 (15.9.1) ^(t) == = J' e^^^dE(x). — 00 This is a function of the real variable t, and will be called the charac- teristic function (abbreviated c. f.) of the variable or of the corre- sponding distribution. The reader is referred to the discussion of the mathematical theory of characteristic functions given in Ch. 10. It follows in particular from this discussion that there is a one- to-one correspondence between distributions and characteristic func- tions. If two distributions are identical, so are their c. f:s, and con- versely. This property has important consequences. In many problems where it is required to find the distribution of some given random variable, it is relatively easy to find the c. f. of the variable. If this is found to agree with the c. f. of some already known distribution, we may conclude that the latter must be identical with the required distribution. The c. f . of any function g(§) is the mean value of In the particular case of a linear function p (5) = a ? + ^ the c. f . becomes (15.9.2) JE;(e*Ma4+6)) = e^itg?(at). Thus e. g. the variable — § has the c, f. q?( — t) = q>(t). Further, the standardized varialle (5 — w)/a has the c. f. 15.10. Semi -invariants. — If the Ar:th moment of the distribution exists, the c. f. may according to (10.1.3) be developed in MacLaurin’s series for small values of t\ (15.10.1) 9p(^)=l +2“! + 1 For the function log (I -h z) we have the corresponding development log (1 + a’) = ^ ^ • ± ” + 0 (^*). 185 15.10 Replacing here 1 + -8^ by gp (<), we obtain after rearrangement of the terms a development of the form (15.10.2) log9(0 = 2^!^*^’’ 1 The coefficients x» were introduced by Thiele (Ref. 37), and are called the semi-invariants or cumulants of the distribution. In order to deduce the relations between the moments and the semi-invariants x^, we may use the identities log (<) = log |n- 2 ^ (*■ <)” j = 2 ^ 00 V (i 0*' 1 in a purely formal way, without paying any attention to questions of existence of moments or convergence of series. It is seen that Xn is a polynomial in Cj, . . and conversely Un is a polynomial in Xj, . . Xu. In particular we have X, = «! = w, x^ — a^j — a? = 0-, (15.10.3) Xg = Ug — 3 a, or^ H“ 2 aj, Xj = Gr^ — 3 02 — 4 ffg + 1 2 a? Og — G at, and conversely o, = Xi, Oo = x^ 4- X?, (15.10.4) Og = Xg + 3 X, Xg + X?, 04 = X4 4- 3 x-5 + 4 x^ Xg 4- G X? Xj 4- x}, In terms of the central moments fiv, the expressions of the x» become 186 15 . 10--11 X, = m, 3^8 = (15.10.5) >^4 = ■“ 3/ia, = 10^2iU3, 3^6 = ^6 — 15 /“a i“4 “ ^0 /uS + 30 /U2, so that the coefficients of skewness and excess introduced in 15.8 are ?. = ,;» Mi The semi-invariants xt of a linear function ^ (f) = $ + 6 are, by (15.9.2), found from the development log (a <)] = 2 ^ ° ■ Comparing with (15.10.2), we obtain the expressions xi — X| -t- 6, and xt — x^ for r > 1. 15 . 11 . Independent variables. Let ^ and be random variables with the d. f:s and and the joint pr. f . P. By (14.4.5) a ne- cessary and sufficient condition for the independence of ^ and y] is that the joint d. f. of the variables is, for all x and y, given by the expression^) (15.11.1) F{x,y) P(^^x,r}^ y) =- (x) F* (y). When both variables have distributions belonging to the same simple type, the independence condition may be expressed in a more con- venient form, as we are now going to show. Consider first the case of two variables of the discrete type, with distributions given by P{^ = Xy) -■ Py, P(r]= y,) == Qy, where 1, 2, . . . It is then easily seen that the independence con- dition (15.11.1) is equivalent to (15.11.2) P{^ = x,„ r]=-yy) = p„qy / for all values of fi and r. ’) Another necessary and sufficient condition will be given in 21.3. 187 15.11-12 In the case of two variables of the continuous type, the independ- ence condition (15.11.1) may be differentiated with respect to 2 ; and and we obtain (15.1 1.3) fix, y) = == fMfM, where /i and are the fr. f:s of ^ and tj, while / is according^ to 8.4 the fr. f . of the joint distribution, or the joint fr.f. of ^ and t}. Conversely, from (15.11.3) we obtain (15.11.1) by direct integi-ation. Thus a necessary and sufficient condition for independence is given hy (15.11.2) in the case of two discrete variables, and by (15.11.3) in the case of two continuous variables. Both conditions immediately extend themselves to an arbitrary finite number of variables. 15.12. Addition of independent variables. — Let ^ and rj be in- dependent random variables with known distributions. By 14.5, the sum 5 + 17 has a distribution uniquely determined by the distributions of § and rj. In many problems it is required to express the d. f., the c. f., the moments etc. of this distribution in terms of the corresponding^ functions and quantities of the given distributions of § and rj. The problem may, of course, be generalized to a sum of more than two independent variables. We shall 6 rst consider the c. f: 8 . Let g>i(t), (p^{t) and g){t) denote the c. f.’s of f, rj and ^ rj respectively. We then have, by the theo- rem (15.3.4) on the mean value of a product of independent factors. y ( ^) = E « (,= +^)) = JE (ei v) = E E (^^n) == (<) This relation is immediately extended to an arbitrary finite number of variables. If are independent variables with the c. f:s SPi(^)) • • M yn(^), the c. f. (p(t) of the sum -f • + is thus given by the relation (15.12.1) q){t) = ?),(<) gp,(() . . . gt>„(<), SO that we have the following important theorem, which expresses a fundamental property of the c. f.s. The characteristic function of a sum of independent variables is equal to the product of the characteristic functions of the terms. We now want to express the d. f. of the sum ? 4- 17 by means of the d. f:s and of the terms. This problem will be treated as an 188 15.12 example of the general method (cf 10.3 and 15.9) of finding a d. f. with the aid of its c. f. Consider the integral FA^-e)dFM — OD Since is bounded, this integral has by 7.1 a finite and determined value for every x. Now {x — -e) is, for every fixed 2 ’, a never de- creasing function of x which is everywhere continuous to the right, and tends to 1 as and to 0 as a: — co . Consider the difference F{x + h) — F{x), where h > 0. It follows from (7.1.4) that this difference is non-negative, and from (7.3.1) that it tends to zero with h. It further follows from (7.3.1) that F(x) tends to 1 as ./?->+ Qc , and to 0 as x — oo . Thus F(.r) is a d. f. The corre- sponding c. f. QD j'e‘'^dF(x) is, by (7.5.G), the limit as n -> co of a sum Sn of the form 1 provided that the maximum length of the sub- in terv als ir») tends to zero, while — Qo and Xn + <» . Introducing here the inte gral expression of F(x), we obtain OO ,v„ = f/„e“'dFg{^), — 00 where Sn = ^e‘^'^lFAcc:)-F,{x:-t)l 1 ./r - ~ S. As M oc , Sn tends for every fixed -S' to the limit 00 lim n'n — f e**^dF^ (x) = — 00 Further, Sn is uniformly bounded, since we have 189 15.12 According to (7.1.7) it then follows that 00 lim 8n = ip lit) f e^'‘dFi{z) = fi (<)9P,(t)- _QD Thus the c. f. of F(x) is identical with the c. f. q)[t) = yilOjPslO ^^e sum ? + so that F[x) is the required d. f. Since the functions F^ and F^ may evidently be interchanged without afEecting the proof, we have established the following theorem: The distribution function F(x) of the sum of two independent vari- ables is given by the expression (15.12.2) Fix) = / F,{x-z)dF^(z) = J F^[x-z)dFM, — 00 — 00 tvhere F^ and F^ are the distribution functions of the terms}) When three d. f:s satisfy (15.12.2), we shall say that F composed of the components F^ and F^, and we shall use the abbreviation (15.12.2 a) F{x) = F, (x) * F,(x) = F,{x) * F, (x). By (15.12.1), this symbolical multiplication of the d. f:s corresponds to a genuine multiplication of the c. f:s. If the three variables and ^3 are independent, an evident modification of the proof of (15.12.2) shows that the sum fa has the d. f. (Fj * F^) * Fj = Fj * {F^ F 3 ). Obviously this may be generalized to any number of components, and it is seen that the opera- tion of composition is commutative and associative. For the sum §1 + ■ ■ + of n independent variables we have the d. f. (15.12.3) F --- F, -X- Fg * • • • * F„. Let us now consider the following two particular cases of the composition of two components according to (15.12.2): a) Both components belong to the discrete type (cf 15.2). b) Both components belong to the continuous type, and at least one of the fr. f:s, say is bounded for all x. In case a), let . . . and - . . denote the discontinuity points of Fj and Fg respectively. It is then evident that the total *) The reader should try to construct a direct proof of this theorem, without the use of characteristic functions. It is to be proved that, in the two-dimensional distribution of the independent variables ^ and 97 , the mass quantity F{x) situated in the half-plane ^ x is given by (15.12.2). Cf. Cramer, Ref. 11, p. 35. 190 15.12 mass of the composed distribution is concentrated in the points Xr + where r and s independently assume the values 1,2,... If the set of all these points has no finite limiting* point, the composed d. f. thus also belongs to the discrete type. This is the case e. g. when all the Xr and y% are non-negative, or when at least one of the sequences [xr] and {y«} is finite. In case b), the first integral in (15.12.2) satisfies the conditions for derivation with respect to x (cf 7.3.2). Further, by (7.3.1) and (7.5.5), the derivative F' [x] = f{x) is continuous for all x, and may be ex- pressed as a Biemann integral 00 00 (15.12.4) f{x) = ffx(x — 2) ft (z) dz = jft [x — z)f (z)dz. — 00 — 00 Thus the composed distribution belongs to the continuous type, and the fr. f. f{x) is everywhere continuous. Beturning to the general case, we denote by Wj, and m the means, and by 0 ^, Og and a the s. d.s of rj and ? + 17 respectively. Since ^ and rj are independent, we then have by (15.3.7) and (15.6.1) (15.12.5) m = m, -h mg, a® = a? + at For the higher moments about the mean, a general expression is de- duced from the relation — E[{B + f] -- m)^] =E[(^ — mi + 1 ? — mg)’’]. Since any first order moment about a mean is zero, we have in par- ticular, using easily understood notations, /“a = (15.12.6) /tj = The composition formulae for moments are directly extended to the case of more than two variables. For the addition of n indepen- dent variables, we thus have the following simple expressions for the moments of the three lowest orders: m = m, -h mg + ' ■ + mn, (15.12.7) (T® = 0 ? 4 (T? -f • ■ + oi, f^s = + M*’ + ■ • • + M"’- For the higher moments (v > 3), the formulae become more complicated. 191 15 . 12 - 16.1 FinalLj, we shall consider the semi-invariants of the composed distribution. The multiplication theorem for characteristic functions gives us log q) (t) = log g>i (t) + log q>t (<)• Hence we obtain by (15.10.2) + x™. This simple composition rule is the chief reason for introducing the semi-invariants. The ex- tension to the case of n independent variables is immediate and gives (15.12.8) + xjj*' -f h CHAPTER 16 . Various Discrete Distributions. 16 . 1 . The function b[x). — The simplest discrete distribution has the total mass 1 concentrated in one single point, say in the point a; = 0. This is the distribution of a variable $ which is » almost al- way8» equal to zero, i. e. such that P(5 = 0) = l. The corresponding d. f. is the function b[x) defined by (6.7.1): (16.1.1) for X < 0, a: ^ 0. The c. f. is identically equal to 1, as we have already remarked in 10.1. More generally, a » variable » which is almost always equal to has the d. f. £(x — a o) and the c. f. The mean of this variable is Xq, and the s. d. is zero. Conversely, if it is known that the s. d. of a certain variable is equal to zero, it follows (cf 15.6) that the whole mass of the distribution is concentrated in one single point, so that the d. f . must be of the form e(x — Xq). The general d. f. of the discrete type as given by (15.2.1) may be written (16.1.2) F(x) = *(*“*»)• Let 11 B consider the particular case of a discrete variable the distri- bution of which is specified in the following way: (16.1.3) { 1 with the probability p, 0 » » » g = 1 —p. 192 16.1-2 In the following paragraph, we shall make an important use of variables possessing this distribution. From (16.1.2) we obtain the d. f. of i‘ F{x) = p — \ ) -\- q € (.r). and hence the c.f. (16.1.4) (p[t) ^ q ^ \ +jD(e'^— 1). The mean and variance of £ are (Ib.l.f)) E (i‘) -- p' \ +7*0= p, DM?) -P? 4- q{i)-p? = pq. 16.2. The binomial distribution. — Let be a given random ex- periment, and denote by E an event having a definite probability p to occur at each performance of (5. Consider a series of independent repetitions of (£’ (cf 14.4), and let us define a random variable iV attached to the r:th ex[)eriinent by writing I 1 when Ej occurs at the r:th experiment (probability — /;), I 0 otherwise (probability ~ ~ \ -—p). Then each ir has the probability distribution (16.1.3) considered in the ])receding paragraph, and the variables f,, . . .. £„ are independent. Obviously ?r denotes the number of occurrences of E in the r:th ex- ])eriment, so that the sum r — 4* -f • 4- denotes the total number of occurrences of the eient E in our series of n repetitions of the experiment (£. Since v is a sum of n independent random variables, it is itself a random variable^), the distribution of which may be found by the methods developed in 15.12. Thus we obtain by (15.12.7) and (16.1.5) the following expressions for the mean, the variance and the s. d. of v: (16.2.1) E(v) = np, D^(v)=^npq, DM^Vnpq, *) Throughout the general theory developed in the preeeding chapters, we have systematically used the letters ^ and q to denote random variables. From now on it would, however, be inconvenient to adhere strictly to this rule. ^ We shall thus often find it practical to allow any other letters (Greek or italic) to denote random variables. It will thus always be necessary to observe with great care the significance of the various letters used in the formulae. 13 — 454 H. Cramer 193 16.2 The ratio vin expresses the frequennj of E in our series of n repeti- tions. For the mean and the s. d. of vIn, we have The c. f. of V is by (15.12.1) equal to the product of the c. f;s of all the 5 ,, and thus we obtain from (IG.1.4) (16.2..^) = (pe^* + f/)" = (1 i pW^ — Iv)". Developing the first expression by the binomial theorem, we find ; -=0 ' ' By (10.1.5) this is, however, the c. f. of a variable which may assume the values y' = 0, 1, , . u with the probabilities E, -- 7'*“'- Owing to the one-to-one correspondence between distributions and char- acteristic functions, we may thus conclude (cf 15.9) that the probability distribution of v is specified by the relation (IG. 2 . 4 ) P(v = r) =- P, - ('_') I>' q" ' (/• - 0, 1, . . »). This is the binomial distribution^ the simplest properties of which we assume to be already known. It is a distribution of the discrete type, involving two parameters n and p, where // is a positive integer, while 0 < 7 ? < 1. (The cases p and p = I are trivial and will be excluded from our discussion.) The corresponding* d.f. (l().2..o) p)=-- P(v^x)= 2 ( 1 !)-?’"'^’'”' is a step-function, with steps of the height Pr in the ?/ + 1 discrete mass points r = 0, 1, . . ., n. In order to find the moments pr about the mean of the binomial distribution, we consider the c.f. of the deviation v—np. This is 194 16.2 = (j) ^ -h q e = I ^ip<r +<r-p-Y^} I • Thus all moments fi, are finite and may be found by equatingr coeffi- cients in the relation 00 I 00 '|» 2j '‘’'h = \ I • I <* J In particular, we find == (T“ — if p ip (1().2.(J) Ih P >1 (<1 — P), = 3 11 ^ j)^ + lip q (I — 6]) q), For the coefficients of skewness and excess, we thus have the expressions V // “ A' ^ - 3 = P-'' ■ '* ' \ npq V npq' ' » P <I The skewness is positive for < |, negative for p > i, and zero for p 1. Both coefficients and y^ tend to zero as n 00 . Let Vi and n denote two independent variables, both havings bino- mial distributions with the same value of the parameter p, and with the values and of the parameter a. We may, e. g., take and 1^2 equal to the number of occurrences of the event E in two indepen- dent series of ??, and 7?g repetitions of the experiment iS. The sum jq -f is then equal to the number of occurrences of E in a series of ip + Wg repetitions. Accordingly the c. f. of + n is (cf 15.12) — {jie^^ -I- qY'(pe'^ -f q)"'’ — qYi f Ms This is the c. f . of a binomial distribution with the parameters p and + ^/g. Thus the addition of two independent variables with the d.f:s p) and En^{x\p) gives (as may, of course, also be directly 195 16.2-3 perceived) a variable with the d. f . i^«, 4 />)• In tbe abbreviated notation of (15.12.2 a) this may be written [x\p)^ BnM‘\ p) = Thus the binomial distribution reproduces itself hy addition of indepen- dent variables. We shall call this an addition theorem for the bino- mial distribution. Later, we shall see that similar (but less evident) addition theorems hold also for certain other important distributions. 16.3. Bernoulli’s theorem. — For the frequency ratio vhi considered in the preceding paragraph, we have by (16.2.2) Wo now apply the Bienayrne-Tchebychef inequality (15.7.2), taking /r where i denotes a given positive (|uantity. Denoting by P the probability function of the variable r, we then obtain the following result: (16.3.1) ) / Pd n f " 1 4 ne~ If d denotes another given positive (juantity, it follows that, as soon as we take the probability on the left hand side of (16.3.1) becomes smaller than d. Since d is arbitrarily small, we have proved the following theorem. The prohability that the frequency vhi differs from its mean value p by a quantity of modulus at least equal to t tends to zero as )i - oo , hoivever small e > 0 is chosen. This is, in modern terminology, the classical Bernoulli theoreju, originally proved by James Bernoulli, in his posthumous work ylr.v (hnjectavdi (1713), in a quite different way. Bernoulli considered the two complementary probabilities 196 16.3 and proved by a direct evaluation of the terms of the binomial ex- j — -gj' pansion that, for any given £ > 0, the ratio may be made to 'Uf exceed any given quantity by choosing n sufficiently large. The variable v is, accordiog to the preceding paragraph, attached to u combined experiment, consisting in a series of n repetitions of the original experiment Thus by 13.6 any probability statement with respect to v is a statement concerning the approximate value of the frequency of some specified event in a series of repetitions of the combined experiment. The frequency interpretation (cf 13.6) of any such probability statement thus always refers to a series of repetitions of the combined experiment. Consider e. g. the frequency interpr/station of the probability tar defined above. We begin by making a series of n repetitions of the experiment ($, and noting the number v of occurrences of the event E. This is our first performance of the com- bined experiment. If the observed number v satisfies the relation j " — p j ^ c, we say that the event E' occurs in the first combined experiment. The event E' has then the probability tar. We then repeat the whole series of n experiments a large number n' of times, so that we finally obtain a aeries of n' repetitions of the combined experiment. The total number of performances of @ required will then, of course, be n'n. Let v' denote the number of occurrences of E' in the whole series of n repetitions of the combined experiment. The frequency interpretation of the probability then consists in the following statement: For large values of it is practically^ certain that the v* frequency will be approximately equal to tJT. Now the Bernoulli theorem as expressed by (16.3.1) shows that, as soon as we take n > 7 “ - , we have tff < 6, where 6 is given and arbitrarily small. In a long 4 o e series of repetitions of the combined experiment (i. e. for large w')* we should then expect the event | j ^ ^ occur with a frequency smaller than S. Choosing for 6 some very small number, and making one single performance of the combined experiment, i.e. one ftingle series of n repetitions of the experiment we may then fcf 13.6) consider it as practically certain that the event \ ^ — p ~ e will not occur. I w What value of rf we should choose in order to realize a satisfactory degree of »practical certainty» depends on the risk that we are willing to run with respect to a failure of our predictions. Suppose, however, that w^e have agreed to consider a certain value as sufficiently small for our purpose. Returning to the original event E with the probability p, we may then give the following more precise state- ment of the frequency interpretation of this probability, as given in 13.5: Let £ >0 he given. If we choose n> — it is practically certain that, in 4 dy£ one single senes of n repetitions of the experiment (f, we shall have 197 16.3-4 This statement may be called the frequency interpretation of the. Bernoulli theorem. Like all fre(iuency interpretations, this is not a mathematical theorem, but a state- ment concerning certain observable facts, which must bold trne if the mathematical theory is to be of any practical value. 16.4, De Moivre’s theorem. — The random variable (16.4.1) ^ = Si + bs + * + considered in the two preceding paragraphs has, by (16.2.1), the mean np and the standard deviation I irpq. The standardized variable (cf 15.6) (16.4.2) I npq thus has the mean 0 and the s. d. 1. The transformation by which we pass from v to A consists, of course, only in a change of origin and scale of the variable. The ordinates in the diagram of the prob- ability distribution have the same values for both variables. We have, in fact, using the same notations as in the preceding paragraphs, for r = 0, 1, . . ., V. The d. f. and the c. f. of the variable v are given by (16.2.5) and (16.2.3) . Denoting by Fn(x) and q>n(t) the corresponding functions of the standardized variable A, we obtain (cf 15.9) Fn (x-) = B„ (n p + :v I p (j ; p), (16.4.3) / Y 9>n(t) = \pr^ wi^</ (je * . We shall now consider the behaviour of the probahilify distribution of A for hicreasing values of when p has a fixed value. We begin by making a transformation of the above expression for the c.f. q)n (^). For any integer ]c> 0 and for any real z we have the MacLaurin expansion (16.4.4) "" = 2 ,! ’ 0 where we use ^ as a general symbol for a real or complex quantity 198 16.4 of modulus not exceeding unity. Using this development with i == 3, we obtain qii , - pqii y wyjg 2npq 3!(nj?(/p’ ij e II V u = pqit Vnpq p*qf q t* 2n pq *d\[n p ’ and hence, introducing in (16.4.3), Writing this gives us 2 ( p g)"^* 1 n log q<n (/) = y • log (l + • Now as 12 tends to infinity while t remains fixed, it is obvious that y y 71 I y\ tends to — , • Hence ' tends to zero, and log 1 1 + ’ I tends to 2 n y ^ \ nf unity. It then follows that log q)n{t) tends to “ 2 ’ finally that Pn (t) -> € for every t. We are now in a position to apply the continuity theorem 10.4 for c. f;s. We have just proved that the sequence {jPnl^)} of c. f;s defined by (16.4.3) converges, for every f, to the limit e which is continuous for all t. By the continuity theorem we then infer 1) that _/* the limit c is itself the c.f. of a certain d.f., and 2) that the sequence of d. f:s {i^V/(.^‘)} defined by (16.4.3) converges to the d.f. which cor- responds to the c.f. e Now we have by (10.5.3) and (10.5.4) 00 e <pw, 199 16.4 where — 00 80 that e ^ is the c. f. of the d.f. 0(x) given by the last expression. This is the important normal distribution function that will be se- parately treated in the following chapter. For our present purpose we only observe that iD{x) is continuous for every x. We have thus proved the following limit theorem for the binomial distribution first obtained by De Moivre in 1733: For every fixed x and p, we have (16.4.5) lim Bn{np 4- xVnp<i \ p) = 0 (x). Thus the binomial distribution of the variable v “ + -f appropriately standardized by the mean and the s. d. according to (16.4.2), tends to the normal distribution as n tends to infinity. We shall see later (cf 17.4) that this is only a particular case of a very general and important theorem concerning the distribution of the sum of a large number of independent random variables. — The method of proof used above has been chosen with a view to prepare the reader for the proof of this general theorem. In the present particular case of the binomial distribution it is, however, possible to reach the same result also by a more direct method, without the use of char- acteristic functions. This is the method usually found in text-books, and we shall here content ourselves with some brief indications on the subject, referring for further detail to some standard treatise on probability theory. The relation (16.4.5) is equivalent to (16.4.6) Z V np</<v-^np-hjlty npq •a)(y^(P(A,) 1 r I e ‘id t \ 27tJ /l for any fixed interval (Ii,Ag). Now (16.4.6) may be proved by means of a direct evaluation of the terms in the binomial expansion. For this purpose, we express the factorials in the binomial coefficient appearing in (16.4.6) by means of the Stirling formula (12.5.3). We then obtain after some calculations the expression 1 . e 27t npq C 200 (16.4.7) 16.4 where G is a has the same thus equal to quantity depending on p, but not on v or a, while S' significance as before. The first member of (16.4.6) is 1 _ V 27 tnp q Vn the sum being extended over the same values of v as in (16.4.6). As Scale of •y .JLzaa. Kig. 8. -i -T**' r 0 “’i"' z 3 l)i.stri])nti<»D function of v (or and normal distribution function. p — 0.3, n 5. Scale of Fifr. 9 . ~^3 -Z -T 0 1 1 *3 " " distribution function of v (or V/ and normal distribution function p -- 0.!J, n — ;io. 201 16.4 Scaiwof 20 Scai« of Fig. 11. ynpq- p'*' ~ '^' and normal frecjuency function, p — 0.8, n = 30. n - > Qo, the second term in this expression tends to zero, while the first term is a Darboux sum approximating the integral in the second member of (16.4.6) and tending to this integral "as its limit. Thus (16.4.6) is proved. For the graphical illustration of the limit theorem (16.4.5), we may in the first place have recourse to a direct comparison between the graphs of the distribution functions Bn and (P, as shown in some cases by Figs. 8 — 9. We may, however, also use the relation (16.4.7). If we allow here v to tend to infinity with n, in such a way that tends to a finite limit we obtain V npq 202 16.4-5 Vnpq- 1 V2n e a . If the scale of v is transformed by choosing^ the mean np bls origin and the s, d. Vnpq as unit, and if at the same time every probability Pv is multiplied by Vnpq^ the upper end-points of the corresponding 1 ordinates will thus approach the frequency -curve y = e a of the V 271 normal distribution, as n <x>. This is illustrated by Figs. 10 — 11. 16.5. The Poisson distribution. — In the preceding paragraph, we have seen that the discrete binomial distribution may, by a limit pas- sage, be transformed into a new distribution of the continuous type, viz. the normal distribution. By an appropriate modification of the limit passage, we may also obtain a limiting distribution of the discrete type. Suppose that, in the binomial distribution, we allow the probability p to depend on n in such a way that p tends to zero when n tends to infinity. More precisely, we shall suppose that <16.5.1) jj = ^, ^ n where A is a positive constant. For the probability Pr given by (16.2.4) we then obtain, as w -> oo, for every fixed r = 0, 1, 2, . . .. The sum of all the limiting values is unity, since we have r=0 = • e ^ = 1 . If the probability distribution of a random variable ^ is specified by (16.5.2) P{i=r) = ~,e- for ?• = 0, 1, 2, . . ., 203 16.5 Fig. 12. Poisson distribution, / — 0.8. Fig. 13. Poisson distribiPion, }, 3.5. ^ is said to possess a Poisson distrihuhon. This is a discrete distribu- tion with one parameter X, which is always j)08itive. All points >• = 0,1,2,... are discrete mass points. Two cases of the distribution are illustrated by Figs, 12 — 13. The c. f. of the Poisson distribution is (1»;.5.3) -e'" = According; to (15.10.2), this shows that the semi-invariants of the distribution are all finite and equal to X. From the two first semi- invariants, we find the mean and the s. d. of the Poisson distribution E(i-) = A, D(?) = V1 204 16.5 Writing - in the second expression (16.2.3) of the c. f. of the l)inomial distribution, and allowing n to tend to infinity, it is readily «een that this function tends to the c. f. (16.5.3) of the Poisson distri- bution. By the continuity theorem 10.4, it then follows that the bino- mial distribution tends to the Poisson distribution, which confirms the result already obtained by direct study of the probability Pr. It is also easily shown that the condition (16.5.1) can be replaced by the more general condition wp -> A, without modifying the result. Finally, if and are independent Poisson-distributed variables, with the parameters and I,, the sum + ^2 has the c. f . 1) * — —1} This is the c.f. of a Poisson distribution with the parameter -h L. Thus the sum has a Poisson distribution with the parameter >-1 -f Aj, and we see that the Poisson distribution, like the binomial, has the property of reproducing itself by addition of independent variables. Denoting by F(x;l) the d.f. of the Poisson distribution, the addition theorem for this distribution is expressed by the relation (1(;.5.4) F(t- a,) * F[X- A,) = F{X-, A. + A,). Ji) statisticiil applications, the Pois.son distribution often appears when we «re <*oiicerncd w'ith the number of occurrences of a certain event in a very large number of observations, the probability for the event to occur in each observation being ver\ small. Examples are the annual number of suicides in a human population, the number of yeast cells in a small sample from a largo (juantity of suspension, etc. <'f e. g. Bortkiewicz, Ref. 63 a. In an important gronp of applications, the fundamental random experiment con- sists in observing the number of occurrences of a certain event during a time inter- val of duration f, where the choice of t is at our liberty. This situation occurs e. g. in problems of telephone traffic, where we are concerned with the number of tele- phone calls during time intervals of various durations. — Suppose that, in such a case, the numbers of occurrences during non-overlapping time intervals are always independent. Suppose further that the probability that exactly one event occurs in an interval of duration Jt is, for small At, equal to XAt-\ o{At, where A is a constant, while the corresponding probability for the occurrence of more than one event is o{Af). — Dividing a time interval of duration t m n equal parts, we may consider the n parts ns representing n repetitions of a random experiment, where the probability for the event to occur in each instance is 205 16.5-6 Allowing n to tend to infinity, we find that the total number of events occurring during the time t will be distributed in a Poisson distribution with the parameter Xt. — Variables of this type are, besides the number of telephone calls already mentioned, the number of disintegrated radioactive atoms, the number of claims in an instirance company, etc. 16 . 6 . The generalized binomial distribution of Poisson. — Suppose that are n random experiments, such that the random vari- ables attached to the experiments are independent. With each experi- ment ®r, we associate an event Er having the probability pr ~ ^ — gr to occur in a performance of ®r. Let us make one performance of each experiment .... ®„, and note in each case whether the associated event occurs or not. We shall call this a series of independent trials. If, in the experiment (ir, the associated event Er occurs, we shall say that the 7*:th trial is a success] in the opposite case we have a failure. Let v be the total number of successes in all 7i trials. What is the probability distribu- tion of In the particular case when all the experiments ®r and all the events Er are identical, v reduces to the variable considered in 16.2, and the required distribution is the binomial distribution. The general case was considered by Poisson (Ref. 32). In the same way as in 16.2, we define a variable attached to the r:th trial, and taking the value 1 for a success (probability pr), and 0 for a failure (probability <7r = 1 — Pr). The variables . . ., are independent, and each haa a distribution of the form (16.1.3). As in the previous case, the total number of successes is j'= + + + ■ • + ?«• The c. f. of the random variable v is the product of the c. f:s of all the 5r: £(««'■) = + qr). The possible values for v are r = 0, 1, . . ., w, and the probability that V takes any particular value r is equal to the coefficient of in the development of the product. For the mean value and the variance of v we have the expres- sions 206 16.6 JEW ==2 1 1 Denoting by P the probability function of r, and writing p for the 1 ” arithmetic mean an application of the Bienayme-Tchebycheff 1 inequality (15.7.2) now gives the result analogous to ( 1(5.3. 1) (U5.<;.2) I V \ ..^Prtjr 1 \ V ^ 4««“' We thus have the following generalization of Bernoulli’s theorem found by Poisson: Thv prohahiliifi that the freqneneij of successes vhi differs from the (ivithmeUr mea)i of the prohah/tities pr by a quantify of modnhis at least equal to f tends to zero as n • " oo, hotrever small e > 0 is chosen. The frc(|uency interpretiitioii of the generalized theorem is quite similar to the one given in 16.3 for the Bernoulli th€M)rem. Consider in particular the ease >\hen all the prohahllities p^. are equal to p We then see that in a tong series of indepen- dent trials, irherc the prohabditij of a success is constanttg equal to p, though all trials mail be different experiments, it is practically certain that the frequency of successes icill be approximately equal top. There is also a generalization of I)e JVloivres theorem (16.4.6) to the present case. This will, however, not be proved here, but will be deduced later as a particular case of a still more general theorem to be proved in 17.4. For the variance of i\ we have found the value D*(r) \pr*lT’ .a series of n trials with the constant probability p- corresponding variance is npq, where o ^ 1 - p ^ In order to compare the two variances we write Pr 'Ir ^ + Pr ' P> 9 + 9r “ 9) -= (p+Pr -p) (</ + P -i’,) = --(JV -/')*• Thus the »Pois8oii variance" Pr 'Jr always smaller than the corresponding "Her- nonlli variance^> 7 ipq. At first sight, this result may seem a litt|e surprising. It be- comes more natural if we consider the extreme case when all the probabilities are equal to 0 or 1, both values being represented. The Poisson variance is then equal to zero, while the Bernoulli variance is necessarily positive. 207 17.1 CHAPTER 17. The Normal Distribution. 17.1. The normal functions. — The normal distribution function. which has already appeared in 10.5 and 10.4, is defined by the relation 1 r ~ • - OC The corresponding normal frequency function is y 27r Diagrams of these functions are given in Figs. 14 — 15, and some nu- merical values are found in Table 1, p. 557. The mean value of the distribution is 0, and the s. d. is 1, as shown by (10.5.1): (17.1.1) (lx — 0, — 00 (xenerally, all moments of odd order vanish, while the moments of even order are according to (10.5.1) (17.1.2) I = j e dx -^ \ S - . . . (2r-l). — OC -- OO Finally, the c. f. is by (10.5.4) OO j2 ^2 (17.1.3) ^ d <I>(x) — J" e - d X ~ 208 17.2 17.2. The normal distribution. — A random variable § will be said to be normally distrihuted ivith the parameters m and o\ or briefly normal (w,o), if the d. f. of ^ is where a>0 and m are constants. The fr. f. is then {jr-m)* 2 o* and we obtain from (17.1.1) 14 — 464 H. Cramir 209 17.2 OO •lo^ dx= f( I 2^ J III + ax)c ^du =i OO — CO _ (»-»ir f/.r oc so that /n and a denote as usual the mean and the s. d. of the variable. The frequency curve 1 ¥:r e ( » — m)‘“ •2 (>- is symmetric and uniinodal (cf 15.5), and reaches its maximum at the j)oint .r = in^ so that m is simultaneously nman, median and mode of the distribution. For r — m _+ <j, the curve has two inflexion points. A change in the value of m causes only a displacement of the curve, witliout modifying its form, whereas a change in the value of a amounts to a change of scale on both coordinate axes. The total area included between the curve and the .r-axis is, of course, always equal to 1 . Curves corresponding to some different values of a are shown in Fig. Id. Fig. 16. Normal frequency curves, m = 0, a 0.4, 1.0, 2.5 210 17.2 The smaller we take a, the more we concentrate the mass of the distribution in the neighbourhood of a? = m. In the limiting case (7 = 0, the whole mass is concentrated in the point a? = w, and con- sequently (cf 16.1) the d. f. is equal to e[x — m). This case will be regarded as a degenerate limiting case and called a singular normal — ~ — I will always be inter- preted as € (x — m). It is often important to find the probability that a normally dis- tributed variable differs from its mean m in either direction by more than a given multiple la of the s. d. This probability is equal to the joint area of the two » tails » of the frequency curve that are cut off by ordinates through the points x — m ± Xa. Owing to the symmetry of the distribution, this is P=P(\^~-m\> Xo) = 2(l-Oa'') = J e~^dj-. X Conversely, we may regard I as a function of P, defined by this equation. Then X expresses, in units of the s. d. a, that deviation from the mean value m, which is exceeded with the given probability P, When P is expressed as a percentage, say P=^ p/100, the corresponding f X =— Xp is called the p percent value of the noimal deviate • Some numerical values of p as a function of Xp, and of Xp as a func- tion of p, are given in Table 2, p. 558. From the value of Xp for p = 50, it follows that the quartiles (cf 15.6) of the normal distribu- § — m a is about 2.0, the 1 % value about 2.6, and the 0.1 % value about 3.3. Deviations exceeding four times the standard deviation have extremely small probabilities. ? — m tion are m ± 0.6746 a. It is further seen that the 5 % value of The standardized variable has the d. f. (Z> (x) and consequently by (17.1.3) the c. f. c ^ . It follows from (15.9.2) that the variable ^ has the c. f. (17.2.1) From this expression, the semi-invariants are found by (15.10.2), and we obtain 211 17.2“3 ( 1 7.2.2) -/i = m, = a*, Xj = = = 0. The moments about the mean of the variable ^ are (17.2.3) = 0, /X2r = 1 • 3 • . . . (2>' — 1)<7^’'. In particular, the coefficients of skewness and excess (cf 1.5.8) are 0 . Finally we observe that, if the variable ^ is normal (m, <y), it follows from (15,1.1) that any linear function +'h is normal (am + I a I a). 17.3. Addition of independent normal variables. — Let . . . , be independent normally distributed variables, the parameters of bein^ w,. and o,. Consider the sum f == + ■ • + Denoting (15.12.7) (17.3.1) by m and a the mean and the s. d. of we then have by w = Wj -4- Wj -f + 7r?„, 0* = crj + (7^ + + (Tn . By the multiplication rule (15.12.1), the c. f. of ^ is the product of the c. f ;b of all the From the expression (17.2.1) for the c. f. of the normal distribution, we obtain *-=1 This is, however, the c. f. of a normal distribution with the parameters m and o, and so we have proved the following important addition theorem for the normal distribution: The sum of any number of independent noimally distributed variables is itself normally distributed: p, o ^ ^ is — m-\ ^ (j — mn\ ^(x — m\ ,n.s.2) J where ni and a are given by (17.3.1). 212 17.3-4 We mention without proof the following conrerse (Cramer, Ref. 11) of this thco* rem: If the sum S = fi + • — h of n independent variables is normally distributed, then each component variable is itself normally distributed, Thns it is not only true that the normal distribution reproduces itself by composition, bnt, moreover, a normal distribution can never be exactly produced by the composition of non-normal components. On the other hand, we shall see in the following paragraph that, under very general conditions, the composition of a large number of non-normal components produces an approximately normal distribution. Since any linear function of a normal variable is, by the preceding* paragraph, itself normal, it follows from (17.3.2) that a linear function h an “f 6 of independent normal variables is itself normal, with parameters m and a given hy m = aim^ + • -h anmw + i^, and (T* =■ aj aj -f • • • + an (fn. In particular, we have the important theorem that, if ^ly are independent and all normal (m, a), the arithmetic 1 n i O mean ? = ^ fr itself normal \m, / “ \ Vn 17.4. The Central Limit Theorem. — Consider a sum {17.4,1) ? = ?! + ?« + • • + ?« of n independent variables, where has the mean m,. and the s. d. Ov. The mean m and the s. d. a of the sum ^ are then given by the usual expressions (17.3.1). In the preceding paragraph we have seen that, if the are nor- mally distributed, the sum ^ is itself normal. On the other hand, De Moivre's theorem (cf 16.4) shows that, in the particular case when the are variables having the simple distribution (16.1.3), the distri- bution of the sum is approximately normal for large values of n. In fact, De Moivre s theorem asserts that in this particular case the d. f. of the standardized variable tends to the normal function <Z> {x) as n tends to infinity. It is a highly remarkable fact that the result thus established by De Moivre' s theorem for a sj)€cial case holds ti'ue undej' much more genei^al circumstances. It will be convenient to introduce the following terminology. Generally, if the distribution of a random variable X depends on a parameter w, and if two quantities mQ and Oq (which "may or may not ^ depend on n) can be fonnd such that the d. f. of the variable — ” 213 17.4 tends to O [x] as w -► oo , we shall say that X is asymptotically tm mal (mo, Oq). This does not imply that the mean and the s. d, of X tend to mo and Oo, nor even that these moments exist, but is simply equi- valent to saying that we have for any interval {a, h) not depending on n lim P(wo + a On < X < nio H- Ioq) = O (h) — <D («). v-^ oo Thus e. g. the variable v considered in De Moivre’s theorem is asympto- tically normal (np,V~np(j). The so called Ccfitral Limit Theorem in the mathematical theory of probability may now be expressed in the following way: Whatevet' he the distributions of the independent variables J,. — subject to certain very general conditions — the sum ^ • -f asymptotically nor- mal (m, O'), where m and a are given by (17.3.1). This fundamental theorem was first stated by Laplace (Ref. 22) in 1812. A rigorous proof under fairly general conditions was given by Liapounoff (Ref. 146, 147) in 1901. The problem of finding the most general conditions of validity has been solved by Feller, Khintchine and Levy (Ref. 85, 86, 140, 145). We shall here only prove the theo- rem in two particular cases that will be sufficient for most statistical applications. Let us first consider the case of equal components, i. e. the case when all the in (17.4.1) have the same distribution. In this case we have m — wm,, a = OiKw, and the standardized variable may be written ^ — m _ ^ — n ffiVn 1 OiV n 2(?.' - wi,), where all the deviations have the same distribution. Denote by 5 Pi(/) the c. f. of any of these deviations, while F{x) and 5 p(f)arethe ^ — m d. f. and the c. f. of the standardized variable ^ It then follows a from (15.9.2) and (15.12.1) that we have ( 17 . 4 . 2 ) The two first moments of the variable fv — m^ are 0 and aj , so that by (10.1.3) we have for the corresponding c. f. the expansion q)t(t) = l — ia]t* + 0 (<*)• 214 17.4 Substituting* for t, we then obtain from (17.4.2) OyV n where for every fixed t the quantity f) tends to zero as n -^oo. _/* It follows that ^ for every f, and hence we infer as in 16.4 that the corresponding d. f. F(x) tends to <2)(j?) for every x. We thus have the following case of the Central Limit Theorem, first proved by Lindeberg and L^vy (Ref. 24, 148): Zf independent random variables all having the same j)robability distribution, and if m^ and o, denote the mean and the s. d. n of every then the sum § = 2 f asymptotically noi'mal [nm^, a^V n). 1 1 ” It follows that the arithmetic mean |= 2^*^ asymptotically normal 1 (m^, ajV n). In the case of equal components, it is thus sufficient for the vali- dity of the Central Limit Theorem to assume that the common dis- tribution of the has a finite moment of the second order. When we proceed to the general case of variables that are not supposed to be equally distributed it is, however, no longer sufficient to assume that each has a finite second order moment, and thus we have to impose some further conditions. The object of such additional condi- tions is, generally speaking, to reduce the probability that an indivi- dual will yield a relatively large contribution to the total value of the sum An interesting sufficient condition of this type has been found by Lindeberg. We shall, however, here only give the following somewhat less general theorem due to Liapounoff; ^st ^ 1 , . . . be independent random variables, and denote bym^ and Ov the mean and the s. d, of Suppose that the third absolute moment of about its mean is finite for every v, and write p* = pj + p; + — ^ • If the condition 2\b 17.4 (17.4.3) Um ^ 0 n^oo 0 n is satisfied^ then the sum ? = asymptotically normal {m, a) ^ where 1 m and a are given hy (17.3.1). In the particular case when all the are equall}’ distributed, we have — a* = na\, and thus -= — » so that the condition is " a,Vv satisfied. It should not be inferred, however, that the Lindeberg-Levy theorem proved above is a particular case of the LiapounofiF theorem, since the former does not assume the existence of the third moment. In order to prove the Liapounoff theorem, we denote by gpr (0 c. f. of the r;th deviation 5,. — and by q)(t) the c. f. of the stand- j- I n ardized sum =- — wi»). Prom (15.9.2) and (15.12.1) it then (J o 1 follows that we have (17.4.4) li 9^- (1) • As before, it is sufficient to ]>rove that for every fixed t we have q)[t)'^e - when oo, as the theorem then directly follows from the continuity theorem 10.4. — Using the expansion (16,4.4) with k = 3, we obtain cfy (t) = E(e^'^^^*'~”‘rj = 1 — j f + 0 19- (4 f , where, as in 16.4, we use as a general notation for a (luantity of modulus not exceeding unity. We further obtain where al e 2o^ + 9 gl f 6 a' ' Owing to the condition (17.4.3) we have, however, for all sufficiently large values of n 216 17.4 a a and thus, observing that by (15.4.6) we have Cv ^ Qv for every v. -’^2a» 6 ) The condition (17.4.3) now shows that for every hxed t we have s->0 as y/ “► oo. Thus certainly \s\<i for all sufficiently large n. For 1^1 < i we have, however, lo.(l+*)== [- 2('-|-' + j*' - • ••) .S' -r t> and hence Summing over r = ^ 1, 2, . . . , y/, we now obtain by (17.4.4) losq>{t) = -[j + l<l® + (I '* + J I'l*) )• As yy tends to infinity, it now follows from the condition (1 7.4.3) that log (f) [t] tends to — . for every fixed f, and thus the Liaponnoff theo- rem is proved. n In the case ,cf. 16.6) of the variable v = ^ which expresses the number of 1 successes in a series of n independent trials with the probabilities p, we have p; = E(| - />r I') = Vr 9r (pj + 9?) ^ Pr 9r- «>^2P--9,.. 0'='^Pr<lr’ 217 17.4-5 u 1 If the st*rit*s divergent, the LiapouiiofT condition (^17.4.3) is satisfied, and 1 Ihus the variable v is asymptotically normal (i”" V i.'’’ ’')■ .\ sufficient condition b>r the divergence of 7^ is, e. g., that a mini her r- > 0 can be found such that c < < l~r for all r. If, on the other hand, illb verqent, it can be proved (Ucf. 11'* that the variable v is not asymptoti(‘ally normal. 17.5. Complementary remarks to the Central Limit Theorem. — The C^entral Limit Theorem has been modified and extended in various directions. In this paragra])h, we shall give a few brief remarks on some of these (luestions, while the following paragraphs will be de- voted to a [jarticular problem belonging to the same order of ideas. 1 . The theorems of the preceding paragraph are exclusively con- cerned with the disfrihutiou fioidloiis of the variables. It is the d. f. of t the standardized sum — that is shown to tend to the normal d. f. a If the component variables fv all belong to the continuous type, the question arises if the frequency function of - tends to the nor- o mal fr. f. <f>' (.r) — / c It can, in fact, be shown (Cramer, Ref. I 27t 11, 70) that this is true if certain general regularity conditions are im- posed on the components (cf 17.7.4). 2. In problems of theoretical statistics it often occurs that we are concerned with a function ?;») of n independent random variables, where n may be considered as a large number. If the func- tion g has continuous derivatives of the first and second orders in the neighbourhood of the point m = . . ., m,,), where m, denotes the mean of i‘, , we may write a Taylor expansion n (17.5.1) ~ 218 17.5 ■where c, is the value of — ^ in the point m, while the remainder li contains derivatives of the second order. The first term on the rit^ht hand side is a constant, while the second term is the sum of n in- dependent random variables, each having^ the mean zero. By the central limit theorem we can then say that, under g;eneral conditions, the sum of the two first terms is asymptotically normal, with a mean (*(|nal to the first term. In many important cases it is possible to show that, in the limit as the presence of the term has no influence on the distribution, so that the function g is, for larg^e values of )i, approximately normally distributed (Cf von Mises, Eef. lo7, loH). We shall return to this question in Oh. 2S. 3. The central limit theorem may be extended to various cases wlien the variables in the sura are not independent. We shall here only indicate one of these extensions (Cramer, Ref. 10, p. 145), which has a considerable importance for various applications, especially to biolog^ical [problems. For further information, the reader may be referred to a book by Levy (Ref. 25), and to papers by Bernstein, Kapteyn and Wicksell (Ref. 63, 135, 230). It will be convenient to use here a terinino logfy directly connected with some of the biologfical aj)plications. If our random variable is the size of some specified orgfan that we are ob- servingf, the actual size of this org^an in a particular individual may often be regfarded as the joint effect of a laro*e number of mutually independent causes, acting; in an ordered sequence during; the time of irrowth of the individual. If these causes simply add their effects, which are assumed to be random variables, we infer by the central limit theorem that the sum is asymptotically normally distributed. In g;eneral it does not, however, seem plausible that the causes co-operate by simple addition. It seems more natural to suppose that each cause g;ive8 an impulse, the effect of which depends both on the strengrth of the impulse and on the size of the ort^an already attained at the instant when the impulse is working;. Suppose that we have n impulses acting in the order of their indices. These we consider as independent random variables. Denote by ,x\, the size of the organ which is produced by the impulses ^ 1 , . . ., . We may then suppose e. g. that the increase caused by the impulse i i is proportional to and to some function .(/(./:,) of the momentary size of the organ : (17.5.2) •t’l 1 1 — -f ^'i H 1 (.r ,). 219 17.5 It follows that we have + 5. t vsn -2 .r.+i — Xr 9 If each impulse only gives a slight contribution to the growth of the organ, we thus have approximately + ?2 + • J- where .i == ./‘n denotes the final size of the organ. By hypothesis are independent variables, and n may be considered as a large number. Under the general regularity conditions of the central limit theorem it thus follows that, in the limit, the function of the random variable x appearing in the second member is normally dis- tributed. Consider, e. g., the case g{t) = f. The effect of each imjmlse is then <iirectly proportional to the momentary size of the organ. In this case we thus find that log x is normally distributed. If, more generally, log (x — a) is normal (m, o), it is easily seen that the variable x itself has the fr. f. - (loK (r - a) — i 2 o* a(x — a)y 2 It for ./■ > a, while for x^a the fr. f. is zero. The corresponding fre qnency curve, which is unimodal and of positive skewness, is illu- strated in Fig. 17. This Jogarithmico-normal distribution may be used as the basic function of expansions in series, analogous to those de- rived from the normal distribution, which are discussed in the follow- ing paragraphs. Similar arguments may be aiipliod also in other cases, e. g. in certain branches of economic statistics. (Consider the distribution of incomes or property values in ii certain population. The position of an individual on the property scale might be re- garded as the effect of a large number of impulses, each of which causes a certain increase of his wealth. It might be argued that the effect of such an impulse would not unreasonably be e.xpected to be proportional to the ^\ealth already attained. If this argument is accepted, w e should expect distributions of incomes or property values to be approximately logarithmieo normal. For low values of the income, the logarithmico- normal curve seems, in fact, to agree fairly well with actual income curves (Quensel. lief. 201, 202). For moderate and large incomes, however, the Pareto distribution discussed in 10.3 generally seems to give a better lit. 220 (i7.r).8) 17.6 rig. 17. The logarithmico-nornial distrilmtion, frequeney enrve for -- 0, »// == 0 4<{, 17.6. Orthogonal expansion derived from the normal distribution. — l^OTisider a random variable § which is the sum (17.6.1) + + ’ ’ + Sn of n independent random variables. Under the conditions of the central limit theorem, the d. f. /'(./*) of the standardized variable - — — o is for large n approximately equal to (J>(x). further, if all the com- ponents fv have distributions of the continuous type, the fr. f. f(x) — t^{x) will (cf 17.5) under certain general regularity conditions be approximately equal to the normal fr. f.^) g? (.r) — (P' (.r). — Writing F{x) — <P(.r) H- Jt (./•), fix) = q> (x) + r (.r), this implies that H{x) and r(a:) = JK'(x) are small for large values of )i, so that (I>(x) and gp(./) may be regarded as first approximations to F{x) and f(x) respectively. It is then natural to ask if, by further analysis of the remainder terms B{x) and ?'(a:), we can find more accurate approximations, e. g. in the form of some expansion of 7? (.r) and r(^r) in series. *) Ah a rule we une the letter (p to deuote a charaeteristic function. Ju the paragraphs 17.0 and 17.7, however, (p{x^ will denote the normal frequency function 1 -"* (p (ar) = F {x = c ^ , while the letter yf will he UHcd for c. f..s. > 2 7t 221 17.6 The same problem may also be considered from a more general point of view. In the applications, we often encounter fr. f:s and d. f:s which are approximately normal, even in cases where there is no reason to assume that the corresponding random variable is gener- ated in the form (17.6.1), as a sum of independent variables. It is then natural to write these functions in the form (17.6.2), and to try to find some convenient expansion for the remainder terms. We shall here discuss two different types of such expansions. In the present paragraph, we shall be concerned with the expansion in orthogonal polynomials known as the Gram-Charlier series of type A (Ref. 9, 65, IIH), while the following paragraph will be devoted to the asymptotic expansion introduced by Edgeworth. In both cases we shall have to content ourselves witli some formal developments and some brief indications of the main results obtained, as the complete proofs are rather complicated. Let us first consider any random variable ^ with a distribution of the continuous type, without assuming that there is a representation the form (17.6.1). As usual we denote the mean and the s. d. of £ by m and a, while fx, denotes the r.th order central moment (cf 15.4) of £, which is supposed to be finite for all r. We shall consider the stamhirdised variablr ^ and denote its d. f. and fr. f. by F [.r] and /(./■) -F'[xl For any fr, f. /(.r), we may consider an expansion of the form (1 7.(i :?) /■(,.■) -- <p {,■) + 9 -/ (.r) + y/' (.r) + where the c, are constant coefficients. According to (12.6.4), we have ^(i)(,^.)^( — 1)' //,,(./;) rjr (.;■), where Hv(x) is the Hermite polynomial of degree and thus (17.6.3) is in reality an expansion in orthogonal imlynomials of tlie type (12.6.2). We shall now determine the coeffi- cients in the same way as in 12.6, assuming that the series may be integrated term by term. Multiplying with jHy(x) and integrating, we directly obtain from the orthogonality relations (12.6.6) CO (17.6.4) r, - (- 1)’ / 7/,.(.r)/(.r)</,r. - oo £ — vi Now is the fr. f. of the standardized variable ‘ , which has 222 17.6 zero mean and unit s. d., while its r:th moment is . Accordinj^lv o' we find = 1, so that the development (17.d.3), and the development obtained by formal intes-ration, may be written =-■- (P(.r) + -f . . , (i7.r).5) fi-r) = <p (•»■) + {•?•) + “ r/i'-” (.<;) -f where the r, are given by (17.().4). From the expressions (ILMJ.r)) of tlie first Hermitc polynomials, we obtain in particular, denoting by and the coefficients of skewness and excess (cf Ib.S) of tlie variable i', il7.b.()) — 7i 1 ii. 3 -- ; — > G^ /<« ( t " ,5''; s — n/ . . With any standardized variable having finite moments of ail orders, we may thus formally associate ihe expansions (17.().r>), tlie coefficients of which are given by (17. (>4). But do these expansions really converge and represent f[x) and F[.i'y^ It can in fact be shown (cf e. g. Cramer, Ref. (ill, 70) that, whenever the integral oo (n.b.lia) d F[x) is convergent, the first series (17.6.5) will converge for every ./ to the sum F[x], If, in addition, the fr. f. f[x) is of bounded variation in (— oo^ oo), the second series (17.6.5) will converge to /'(.c) in ever> continuity point of f{x). — On the other hand, it can be shown by examples (cf Ex. 18, p, 258) that, if these conditions^are not satisfied, the expansions may be divergent. Thus it is in reality only for a comparatively small class of distributions that we can assert the 223 17.6 validity of the expansions (17.6.5). In fact, the majority of the im- portant distributions treated in the two following chapters are not included in this class. However, in practical applications it is in most cases only of little value to know the convergence properties of our expansions. What w€ really want to know is whether a small number of terms — usually not more than tivo or three — suffice to give a good apjjroximation to f (a:) and F (a:). If we know this to be the case, it does not con- cern us much whether the infinite series is convergent or divergent. And conversely, if we know that one of the series (17.6.5) is conver- gent, this knowledge is of little practical value if it will be necessary to calculate a large number of the coefficients c, in order to have the sum of the series determined to a reasonable approximation. It is particularly when we are dealing with a variable ^ generated in the form (17.6.1) that the question thus indicated becomes impor- tant. As ])ointed out above, we know that under certain general conditions and f(x) are approximately equal to 0{x) and q)[x) when n is large. Will the approximation be improved if we include the term involving the third derivative in (17.6.5)? And will the consideration of further terms of the expansions yield a still better approximation? It will be seen that we are here in reality concerned with a question relating to the asymptotic properties of our expansions for large values of n. In order to simplify the algebraical calculations, we shall consider the case of equal components (cf 17.4), when all the components in (17.6.1) have the same distribution, with the mean m^ and the s. d. Oj, so that we have m = nm^, a = n. In this case, we now propose to study the behaviour of the coefficients r, of the A-series for large values of n. ^ — 'fif Let ilj(t) denote the c. f. of the standardized sum ' , while o ipiit) is the c. f. of the deviation f, — m^. According to (17.4.2) we then have For r = 1, 2, . . . , let x, denote the semi-invariants of ^ =- n 2 (5» — while x' are the semi-invariants of — wq, and put 1 224 17.6 (17.6.7) = a:. = *;. ^ <y’ a\ We then have by (15.12.8) (17.6.8) x,. = Mx;., 1,=-^. «- By the definition of the c. f. %p{t) we have e** tl/(t) = f f{x) dx, — oo and hence obtain according^ to (12.6.7) the expansion t* oo (17.6.9) e^rp(t) = y^*y{.-ity 0 or (17.6.10) xff{t} = e •^ + ^»(-/^)»e where Cv is given by (17.6.4). It should be observed that we cannot in general assert that the power series in the second member is convergent, but only that it holds as an asymptotic expansion for small values of t in the same sense as (10.1.3). If we compare (17.6.10) with the expansion ( 1 7 .6. 1 1 ) /(a?) = (r) + g , (.r) + (r) + . . . , it will be seen that the terms of the two expansions correspond by means of the following relation obtained from (10.5.5): " oo __ /* (17.6.12) J (x) dx = ( — itye , (v = 0, 1, 2, . . .). — oo As remarked in an analogous case in 15.10, we may use power series of the type (17.6.9) in a purely formal way, without paying any attention to questions of convergence) as long as we are only concerned with the deduction of the algebraic relations between the various parameters, such as the Cr and the Ar . Thus we may write, in accord- ance with 15.10 and using (17.6.7), 225 15 — 464 H. Cramir 17.6 '(p{t) = Now has the mean zero and the s. d. cfi. Thus -/i = 0 and '^2 = o ] , so that = 0 and = 1 . Hence we may write the last relation (17.6.13) In order to obtain an explicit expression for c, in terms of the A!! , it now only remains to develop this expression in powers of t, and iden- tify the resulting series with (17.6.9). In this way we obtain (17.6.14) Cs = As n'* a; H a:, „«/. A'o 10 A? and generally 2 which shows that r, is of the form (17.6.15) c, ^ ttv i ll -r where [ia/ 3J denotes the greatest integer ^r/3, while the a^h are poly- nomials in the A', , which are independent of v. Thus 226 17.6-7 as n tends to infinity. The following table shows the order of magni- tude of Cr for the first values of v. Subscript v 3 4 , 6 5, 7, 9 8 , 10 , 12 11, 13, 15 Thus the order of magnitude of the terms of the mi series is not steadily decreasing as v increases. Suppose, e. g., that we want to calculate a partial sum of the series (17.6.11), taking account of all terms involving corrections to q>{x) of order or . It then follows from the table that we must consider the terms up to v = 6 inclusive. In order to calculate the coefficients Cv of these terms according to (17.6.6) or (17.6.14), we shall require the moments /Xv or the semi-invariants K up to the sixth order. An inspection of (17.6.14) shows, however, that the contributions of order and n' ^ really do not contain any semi-invariants of order higher than the fourth, so that in reality it ought not to be necessary to go beyond this order. If we want to proceed further and include terms containing the factors etc., it is easily seen that we shall encounter precisely similar inadequacies. Thus the Grain-Oharlier A -series cannot be considered as a satis- factory solution of the expansion problem for F(x) and f{x). We want, in fact, a series which gives a straightforward expansion in powers of and is such that the calculation of the terms up to a certain order of magnitude does not require the knowledge of any moments or semi-invariants that are not really necessary. These con- ditions are satisfied by Edgeworth’s series, which will be treated in the following paragraph. 17.7. Asymptotic expansion derived from the normal distribution. — In the preceding paragraph, the expansion of the function (17.7.1) 1/^(0 in powers of t furnished expressions of the coefficients Cv in the A- 227 17.7 series. The same function (17.7.1) can however, also be expanded in a difiEerent way, viz. in powers of Writing {ity Iff (t) e ' \v+2y 1 ' ^ h\ \‘^\v + 2)! *=o L<=i we obtain after development + hr, r + 2 (i ,. + 4 (t + + h.,^r{iiY' -C ^ ^ where hv,v+ 2 h is a polynomial in As, . . ., which is independent of n. By the integral relation (17.6.12), this corresponds to the ex- pansion in powers of (17.7.2) f[x) = 9,(x) + 2(- 1)- ^ the first terms of which are, writing all terms of a certain order with respect to 7i on the same line, f[x)=q)(x) 1 Al 10 A? 1 . 35 AU; . 280 As* ,.w . + By (17.6.7) and (17.6.8) the coefficients may be expressed in terms of the semi-invariants x, , which in their turn may be replaced by the central moments fiv by means of (15.10.5). In this way we obtain the series introduced by Edgeworth (Ref. 80): 228 17.7 /(./■) = gp (.f) - ii (17.7.3) gp'"'' (./•) - .-!! (-? - aO W - f ? ’-"'M - f." (?)■»■'’' W where the terms on each line are of the same order of magnitude. In order to obtain a corresponding expansion for the d. f. jF’(r) we have only to replace q>[oc) by 0[x). The asymptotic properties of these series have been investigated by Cramer (Ref. 11, 70) who has shown that, under fairly general con- ditions, the series (17.7.2) really gives an asymptotic expansion of f[x) in powers of with a remainder term of the same order as the first term neglected. Analogous results hold true for If we consider only the first term of the series, it follows in particular that we have in these cases ( 1 7.7.4) I t'{.r] - 0) (x) 1 < -i . I/C/') - f (x) I < . In \ n where A and B are constants.^) The terms of order in Edgeworth’s series contain the moments /t.,, . . ., /Ur+!», which are precisely the moments necessarily required for an approximation to this order. In practice it is usually not advisable to go beyond the third and fourth moments. The terms containing these moments will, however, often be found to give a good approxi- mation to the distribution. For the numerical calculations, tables of the derivatives (./) will be required. These are given in Table 1, p. 557. Introducing the coefficients and y^ of skewness and excess (cf 15.8), we may write the expression for /(.r) up to terms of order (17.7.5) = (f ir) — gpl^*’ (x) + gp<^) (x) + gp*"’ (x). *) It has been shown by Esseen (Ref. 83) and Bergstrom (Ref. 62) that the ineijuality for | F — 4* | holds under the sole condition that Xg is iinite. 229 17.7 Diagfraiiis of the derivatives q ^ and r/ , with the numerical coefficients appeariii^r in (17.7.5), are shown in Fi^. 18. The curves for and q)^^^ are symmetric about x ” 0, while the third derivative introduces an asymmetric element into the expression. For large x, the expression (17.7.5) will sometimes yield small nega- tive values for f[x). This is, of course, quite consistent with the fact that (17.7.5) gives an aiiproximate, but not an exact, expression for the frequency function. For the mode Xq of the fr. f., we obtain from (17.7.5) the approxi- mate expression Xq— — Jyj, which is Oharlier’s measure of skewness. We further have /(0)-9p(0) , .. ^(U) The first member represents the relative excess of the frequency curve y—f(x) over the normal curve y=^q^[x) at the point x = 0.^) For If, instead of comparing the ordinates hi the mean x - 0, ^ye compare the ordi- nates in the modes of the two curves, we obtain in the first approxiniatirm ifiO) . 1 <p(o) 230 17.7-8 this quantity, Oharlier gave the expression J, which he introduced as his measure of excess. However, it follows from the above that the term in y] must be included in order to have an expression of the excess which is correct up to terms of the order (cf 15.8). 17.8. The rdle of the normal distribution in statistics. — The normal distribution was first found in 1733 by De Moivre (Ref. 29), in connection with his discussion of the limiting form of the binomial distribution treated in 16.4. De Moivre’s discovery seems, however, to have passed unnoticed, and it was not until long afterwards that the normal distribution was rediscovered by Gauss (Ref. 16, 1809) and Laplace (Ref. 22, 1812). The latter did, in fact, touch the subject already in some papers about 1780, thougli he did not go dee]>er into it before his great work of 1812. Gauss and Laplace were both led to the normal function in connection with their work on the theory of errors of observation. Laplace gave, moreover, the first (incomplete) stafonient of the general theorem studied above under the name of the Central Limit Theorem, and made a great number of important applications of the normal distribution to various questions in the theory of proba- i)ility. Under the influence of the great works of Gauss and Laplace, it was for a long time mor(* or less regarded as an axiom that statistical distributions of [)ractically all kinds would a]>proach the normal dis> tribution as an ideal limiting form, if only we could dispose of a sufficiently large number of sufficiently accurate observations. The deviation of any random variable from its mean was regarded as an error ^ subject to the »law of errors» expressed by the normal distribution. Even if this view was definitely exaggerated and has had to be con- siderably modified, it is undeniable that, in a large number of im- portant applications, we meet distributions which are at least approxi- mately normal. Such is the case, e. g., with the distributions of errors of physical and astronomical measurements, a greai number of demo- graphical and biological distributions, etc. The central limit theorem affords a theoretical explanation of these empirical facts. According to the » hypothesis of elementary errors introduced by Hagen and Bessel, the total error committed at a physi- <^al or astronomical measurement is regarded as the sum of a large number of mutually independent elementary errors. By the central 231 17.8 limit theorem, the total error should then be approximately normally distributed. — In a similar way, it often seems reasonable to regard a random variable observed e. g. in some biological investigation as being the total effect of a large number of independent causes, which sum up their effects. The same ])oint of view may be applied to the variables occurring in many technical and economical questions. Thus the total consumption of electric energy delivered by a certain pro- ducer is the sum of the quantities consumed by the various customers, the total gain or loss on the risk business of an insurance company is the sum of the gains or losses on each single i)olicy, etc. In cases of this character, we should expect to find at least approximately normal distributions. If the number of components is not sufficiently large, or if the various components cannot be regarded as strictly additive and independent, the modifications of the central limit theorem indicated in 17.5 — 17.7 may still show that the distri- bution is approximately normal, or they may indicate the use of some distribution closely related to the normal, such as the asymptotic expansion (17.7.3) or the logarithmico-normal distribution (17.5.3). Under the conditions of the central limit theorem, the arithmetic mean of a large number of independent variables is approximately normally distributed. The remarks made in connection with (17.5.1) imply that this property holds true even for certain functions of a more general character than the mean. These properties are of a fundamental importance for many methods used in statistical practice, where we are largely concerned with means and other similar func- tions of the observed values of random variables (cf Ch. 28). There is a famous remark b}" Lippman (quoted by Poincare, Ref. 31) to the effect that everybody believes in the law of errors, the experimenters because they think it is a mathematical theorem, the mathematicians because they think it is an experimental fact». — It seems appropriate to comment that both parties are perfectly right, provided that their belief is not too absolute: mathematical proof tells us that, undtr certain qualiftfiiig conditions, we are justified in expecting a normal distribution, while statistical experience shows that, in fact, distributions are often approximatelff )ior)u(fL 232 18.1 CHAPTER 18. Various Distributions Related to the Normal. In this chapter, yre shall consider the distributions of some simple functions of normally distributed variables. All these distributions have important statistical applications, and will reappear in various connections in Part III. 18 . 1 . The distribution. — Let ^ be a random variable which is normal (0, 1). The fr. f. of the square is, by (15.1.4), equal to 1 ■ e 2 I 2 Ttx for a; > 0. For S 0, the fr. f. is zero. The c.f. corresponding to this fr. f. is obtained by putting a = I = I in (12.3.4), and is 00 ( ‘ ‘idx == (] -— 2 it)'~K J [ 2 7t X t Let now be n independent random variables, each of which is normal (0, 1), and consider the variable (18.1.1) = 1 Each has the c.f. (1 — 2tt)~i, and thus by the multiplication theo* rem (15.12.1) the sum bas the c.f. n (18.1.2) f:(e*'^*)==(l -277)"-^. This is, however, the c.f. obtained by putting a = J, X == in in (12.3.4), and the corresponding distribution is thus defined by the fr. f. ,/'(x; L in) as given by (12.3.3). We shall introduce a particular nota- tion for this fr. f., writing for any w — 1, 2, . . . 1 "V 7 T e for x > 0, / (18.1.3) ^>>(*^) = * 2*7'(-”) 0 for X ^0. 233 18.1 Thus i/, (.x) is the fr. f. of the variable so that we have kn (a*) dx = P(x < X* < + d:t). The corresponding d. f. is zero for ^ 0, while for x > 0 it is (18.1.4) KAx) = P(x^^x) = 1 X 0 n i dt. The distribution defined by the fr. f. i«(.r) or the d. f. Kn{x) is known as the y^-distrihution^ a name referring to an important statistical app- lication of the distribution. This will be treated in Ch. 30. The distribution contains a parameter n, which is often denoted as the number of degrees of freedom in the distribution. The meaning of this term will be explained in Ch. 29. The ^^-distribution was first found by Helmert (Ref. 125) and K. Pearson (Ref. 183). For 11 £ 2, the fr. f. kn[x) is steadily decreasing for x > 0, while for n> 2 there is a unique maximum at the point x == v — 2. Dia- grams of the function 1cn[x) are shown for some values of n in Fig. 19. The moments a, and the semi invariants x,. of the ;^*-distribution are finite for all and their general expressions may be obtained c. g. from the c. f. (18.1.2), using the formulae in 10.1 and 15.10: ofr === n (?i -f 2) ■ • • (?^ + 2 V — 2), (18.1.5) Xr = 2*’ ^(v l) ! ft. Hence in ])articular (18.1.6) E(x^) =■- a^ w, D^(x^) — a'i = 2 n. Let xi t)e two independent variables distributed according to (18.1.4) with the values «, and of the parameter. The expres- sion (18.1.2) of the c. f . of the ^^-distribution then .shows that the c. f. of the sum y^i + is n, . Ilj + ft.; (1 -2it)~ =(1 - 2a) Thus the x' distribution, like the binomial, the Poisson and the nor- mal, reproduces itself by composition, and we have the addition theorem: (18.1.7) A'„, (.r) x- Kn,{or) = A'„,+„,(.r). 234 18.1 This may, in fact, be reg'arded as an evident consequence of the de- finition (18.1.1) of the variable since the sum x] ^'he sum of + Wg independent squares. Extensive tables of the distribution are available (Ref. 262, 204, 205). In many applications, it is important to find the probability P that the variable assumes a value exceeding^ a given quantity xl> This prob- ability is equal to the area of the tail of the frequency curve situated to the right of an ordinate through the point x — Xo Thus 00 r =-■ p(z“ > xt) = / kn (a?) dx = 1 — Kn (;:o). Usually it is most convenient to tabulate xt as a function of the probability P. When P is expressed in percent, say P = p/100, the 235 18.1 corresponding xo == Xp called the p percent value of x^ n degrees of freedom. Some numerical values of this function are given in Table 3, p. 559. We shall now give some simple transformations of the jj^ distribu- tion that are often required in the applications. If each of the independent variables x,, . . ., a:,, is normal (0, a), where a > 0 is an arbitrary constant, the variables ^ are (T a independent and normal (0, 1). Thus according to the above the fr. f. of the variable 2 (“) equal to Ici,{x). Then by (15.1.2) the fr. f. of 1 ' ' n the variable ^ (18.1.8) (.r > 0). By similar easy transformations, we find the fr. f:s of the arithmetic 1 n T /^n mean non-negative square root y xl and the 1 1 I / 1 ” square root of the arithmetic mean y results are shown 1 in the following table, Xn are throughout supposed to be in- dependent and normal (0, a). For x < 0, the fr. f:s are all equal to zero. Variable. n 2 Frequency function (x > 0). 1_ n n n X 2 n 22 a” r ^.11-1^ 2 o’* 236 18.1-2 n If the horizontal and vertical deviations u and v of a shot from the centre of the target are independent and normal (0, <y), the distance r — Vu* -f- from the centre will have the fr. f. If the components ii, v and of the velocity of a molecule with respect to a system of rectangular axes are independent and normal (0, (7\ the velocity = V f/* H v* -t- u;* will have the fr. f. 18.2. Student’s distribution. — Suppose that the n -f 1 random vari- ables 5 and ^ 1 , . . Jn are independent and normal (0, a). Let us write 1 the variable (18.2.1) where the square root is taken positively, and consider Let Sn (x) denote the d. f . of the variable f, so that we have ,S'„ (x) - P(< ^ x) = P g • By hypothesis f and rj are independent variables, and thus according’ to (15.11.3) their joint fr. f . is the product of the fr. f:s of J and r;. Now 5 is normal (0, a), and rj has the fr. f . g’iven in the last line of the table in the preceding paragraph, so that the joint fr. f. is^) ‘) As a rule w^e have hitherto used corresponding letters from different alphabets to denote a random variable and the variable in its d. f. or fr. f., and have thus employed expressions such as: »The random variable ^ has the fr. f. /(./•)». When dealing with many variables simultaneously it is, however, sometimes practical to depart from this rule and use the same letter in both places. We shall thns oc- casionally use expressions such as: «The random variable ^ has the fr.t. or ^^The random variables ^ and ?] have the joint fr. f. rD». 237 18.2 where ?; > 0 and Cn ^ I V /"2 n The probabilit}^ of the relation ^ ^ is the integral of the joint fr. f. orer the domain defined by the inequalities r] > 0, ^<xr]\ Su ^4 n ,;2 Introducing new variables w, v by the substitution ( 18 . 2 , 2 ) ^ = r} = 't\ the Jacobian of which is --T==t’, we obtain d (n, v) ( 18 . 2 . 3 ) r oo S„ (. t ) = Cti I <^1* I < n I w* 2^ ,,8 dr d U n 4 1 ?? + W*) The corresponding fr. f . Sn (^) Sn (x)' exists for all values of x and is given by the expression 238 ( 18 . 2 . 4 ) 18.2 The distribution defined by the fr. f. or the d.f. Sn(x) is known under the name of Student's distribution or the t' distribution. It was first used in an important statistical problem by W. S. Gosset, writing under the pen-name of » Student » (Ref. 221). As in the case of the ;^®-distribution, the parameter n is often denoted as the number of degrees of freedom in the distribution (cf. 29.2). From the expression of the fr. f . .s>, (x), it is seen that the distribution is independent of the s. d. a of the basic variables ^ and This was, of course, to be expected since the variable i is a homogeneous function of degree zero in the basic variables. — It is further seen that the distribution is unimodal and symmetric about u: = 0. The )':th moment of the distribution is finite for v < u. In j)articular, the mean is finite for n > 1, and the s. d. for n > 2. Owing to the sym- metry of the distribution, all existing moments of odd order are zero, while a simple calculation gives 00 D=(<)= ^ x^s„{.r)dx= - Oo and generally for 2 v < n ^ _ 1 ♦ 3 ■ • ; v — J ) If _ 2){n — 4) • (72 — 2v) The probability that the variable t differs from its mean zero in either direction by more than a given quantity is, as in the case of the normal distribution equal to the joint area of the two tails of the frequency curve cut off* by ordinates through the points ± On account of the symmetry of the ^distribution, this is oo (18.2.5) P -- /'(1 1 1 > Q = 2 I* s„(a;) dx = 2(1- From this relation, the deviation may be tabulated as a function of the probability P. When P = ^/100, the corresponding ^ is called the percent value of t for n degrees of freedom. Some numerical values of this function are given in Table 4, p. 560. For large values of 72 , the variable t is asymptotically normal (0, 1),. ill accordance with the relations 1 lim S„U) — <Z>(.t), lim i?„(x) = <D' {x) — -y= e 2 , // -► 00 Of r 2 7C 239 18.2 Kig. 20. Student’s distribution, frequency curve for 72 = 8: . Normal frequency curve, w = 0, 0 = 1: which will be proved in 20.2. For small n the ^-distribution differs, however, considerably from the limiting normal distribution, as seen from Table 4, where the figures for the limiting case are found under n=<x). A diagram of Student’s distribution for w = 3, compared with the normal curve, is given in Fig. 20. It is evident from the diagram that the probability of a large deviation from the mean is considerably greater in the ^di8tribution than in the normal. If, instead of the variable t as defined by (18.2.1), we consider the variable ( 18 . 2 . 6 ^ (w > 1\ the numerator and the denominator are no longer independent, and the distribution cannot be obtained in the same way as before. It is obvious that we always have T* ^ n, so that the fr. f. of r is certainly equal to zero outside the interval v'n, y n). Writing it is seen that f is given by an expression of the form (18.2.1), with n replaced by n — 1. Thus t' is distributed in Student's distVibution with the d. f . i8^^_|(x). When T increases from — • to it Is further seen that t' increases steadily from — 00 to + « . It follows that the relation t < a* is equivalent to the relation 240 18.2-3 and we have Pit <x)^P Wo have thus found the d. f. of the variable r. Differentiating with respect to .r, we obtain for the fr. f. of t the erpression ■-‘(Kr n-8 where | | ^ V n. For n = 2, the frequency curve is »U-8haped»» i. e. it has a mini- mum at the mean — 0. For n — 3, the fr. f. is constant, and we have a rectangular distribution (cf 19.1). For n > 8, the distribution is unimodal and sym- metric about ir = 0. The mean of the distribution is 0, and the s. d. is 1 for all values of n. 18.3. Fisher's z- distribution. — Suppose that the m n random variables f,, . . rjn are independent and normal (0, o). Put and consider the variable (18.3.1) n 2 ^ V 1 Let Fnin(x) denote the d.f, of the variable x. Since ^ and 17 are both non-negative, we have x ^ 0, and Fmn{x) is equal to zero for a; < 0. Por a; > 0, we may use the same method as in the preceding para- graph to find Fmn[x). Since by hypothesis ^ and ij are independent, Fmn(x) is equal to the integral of the product of the fr. f:s of ^ and r] over the domain defined by the inequalities rj > 0, 0 < ^ < xrj. The fr. f:s of § and rj may be taken from the table in 18.1, and so we obtain 16—464 H. CramSr 241 18.3 n — 1 m+n 2 r Introducing new variables u,v by the substitution (18.2.2), we find m n (*^) — Clm n m+n V 2 w-t-l (1 u. 0 Hence we obtain by differentiation the fr. f . fm n (a?) = Fm n [x) of the variable x: (18.3.2) n ( 3 ?) [x > 0). Like the ^distribution, this is independent of o. In the particular case m = 1 , the variable n x has an expression of the same form as the square of the variable t defined by (18.2.1). In the analysis of variance introduced by R. A. Fisher (cf Ch. 36), we are concerned with a variable z defined by the relation (18.3.3) 1 The mean and the variance of the variable are easily found from the distribution of x: (18.3.4) (« > 2), D*(e*=) 2 w* (m + M — 2) (« > i)- 242 18.3-4 For m > 2, the distribution of has a unique mode at the point m — 2 ^ m w 4- 2 In order to find the distribution of the variable z itself, we ob- serve that when x increases from 0 to oo, (18.3.3) shows that z in- creases steadily from — oo to 4- <» . Thus the relation z < ./* is m equivalent to x < -^ and the d. f . of z is P{z < x) = p(>c < ^c-) = P... e-) . Differentiating with respect to j\ we obtain for the fr. f. of z the expression given by R. A. Fisher (Ref. 13, 94) (18.3.5) gWl i- in-M» (m 4- w) ii 18.4. The Beta -distribution. — Using the same notations as in the preceding paragraph, we consider the variable') (18.4.1) 1 4- X ni n We obviously have 0 ^ A ^ 1, so that the fr. f. of X is zero outside the interval (0, 1). As x increases from 0 to oo, A increases steadily from X 0 to 1. The relation X<x is thus equivalent with x< the d. f. of A is Hence we obtain the fr. f . of A: 1 (18.4.2) (1 —x) (m + n\ / r \ V 2 / *) In the particular case m — 1, the variable (n -h 1)A has an expression of the same form as the square of the variable r defined by (18.2.6). 243 18.4-19.1 This is the particular case i^ = ~, 3“^ f . fi(x; p, q) given bj (12.4.5). In the general case, the distribution defined by the fr. f. (18.4.3) = (0<x<l, p>0,q>0\ will be called the Beta-distrihuiion. The v:th moment of this distribu- tion is (18.4.4) j p, ij) dx = rjp + q) /’(p "f (/ H“ I') Hence in particular the mean is ^ , while the variance is p q (p + qYip + 3+1)' P 1 For p > 1 , <7 > 1 , there is a unique mode at the point x = — ^ — — • CHAPTER 19. Further Continuous Distributions. 19.1. The rectangular distribution. — A random variable ^ will be said to have a rectangular distribution, if its fr. f. is constantly equal cTi. ^ certain finite interval (a — h, a + h), and zero outside this 2 h interval. The frequency curve then consists of a rectangle on the range (a — h, a + li) as base and of height We shall also say in this case that J is uniformly distributed over [a — h, a + h). The mean of this distribution is a, and the variance is - • . 6 The error introduced in ii numerically calculated quantity by the » rounding otf» may often be considered as uniformly distributed over the range J), in units of the last figure. By a linear transformation of the variable, the range of the distri- bution may always be transferred to any given interval. Thus e. g. 244 19.1 the variable rj = 5 — a + h . is uniformly distributed over the interval (0, 1). The corresponding fr. f . is io outside (0, 1). tf Vii Vt^ • • • independent variables uniformly distributed over (0,1), it is evident that the sum 171 + • • -f is confined to the interval (0, n). Tf fn (ic) denotes the fr. f. of rj^ h 7]n, it thus follows that fn(j') is zero outside (0, n). It further follows from (15.12.4). that we have QD X /n-f 1 {x) = j A (X— t)f„ (t) (It = jfn (<) d t. — 00 X — 1 From this relation, we obtain by easy calculations ( X for 0 < X < 1 , y;(-r) = I .r — 2 (.r — 1 ) for 1 <x <2, \ for 0 < x < 1 , L (-^') = i (•^■‘ U’ “-0^) for 1 < < 2, ^ i (.r* - 3 - D* -h 3 fe 2)*) for 2 < t< 3. The general expression, which may be verified by induction, is W = („ -- 1)! - (1) + (2) ~ ' “ ■] where 0 < a; < and the summation is continued as long as the arguments x, a? — 1, x — 2, ... are positive. /, is a discontinuous frequency function, is continuous but has a discontinuous derivative, has a continuous derivative but a dis- continuous second derivative, and so on. Diagrams of and are shown in Fig. 21. The mean and the s. d. of the sum 17, 4- + 17,* are and 1 / ” y 12 ’ so that the fr. f. of the standardized sum is As M increases, this rapidly approaches the normal frequency function 1 _»* '• 245 19.1-2 f. The expression of f^[x) ^iven above may be written in the form /*(*)=! — 11 —.r|, (0<.'r<2). This fr. f., and any fr. f . obtained from it by a linear transformation, is sometimes said to define a triangular distrihifion , 19.2. Gauchy*s and Laplace’s distributions. — In the particular case w = l, Student’s distrilnition (18.2.4) has the fr. f. 1 71 (i 4 the c. f. of w^hich is, by (10.5.7), equal to By a linear trans- formation, we obtain the fr. f. ( 19 . 2 . 1 ) + with the c. f. (19.2.2) where A > 0. The distribution defined by the fr. f. c(x; A, /ix), or by the corresponding d. f. C(x] I, g), is called Cauchg's dlstrihution. The distribution is unimodal and symmetric about the point x = which is the mode and the median of the distribution. No moment of posi- tive order, not even the mean, is finite. The quartiles (cf 15.0) are /X ± A, so that the semi-interquartile rancfe is equal to A. If a variable J is distributed according^ to (19.2.1), any linear function + b has a distribution of the same type, with parameters A' = I a 1 A and = a ^ 4 - b. 246 19.2-3 The form (19.2.2) of the c. f. immediately shows that this distribu- tion reproduces itself by composition, so that we have the addition theorem: (19.2.3) C(:x\ * C[x\ = C[x\ + X^, fi^ + fx^). Hence we deduce the following interesting property of the Cauchy distri- bution: If Ij, . . are independent ^ and all have the same Cauchy 1 ” distribution j the arithmetic mean f = - £v has the same distribution as t ^ cve?‘y 5 r. The two reciprocal Fourier integrals (10.5.6) and (10.5.7) connect the Cauchy distribution with the Laplace distribution^ which has the fr. f. The latter fr. f . has finite moments of every order, while its derivative is discontinuous at x = 0. By a linear transformation, we obtain the fr. f. (19.2.4) ^ with the c. f . efiit r+ A*?’ 19.3. Truncated distributions. — Suppose that we are concerned with a random variable attached to the random experiment (J. Let as usual P and F denote the pr. f. and the d. f. of From a se- quence of repetitions of @, we select the sub-sequence where the observed value of § belongs to a fixed set Sq. The distribution of ^ in the group of selected cases will then be the conditional distri- bution of relative to the hypothesis J C Sq, According to (14.3.1) or (14.3.2), the conditional probability of the event ^ Sy where S is any subset of Sq, may be written The case when Sq is an interval a < ^ ^ b often presents itself in the applications. This means that we discard all observations where the observed value is ^ a or > b. The remaining cases then yield a truncated distribution with the d.. f . J TO - TO F{h) - F{a) 1 for X ^ «, for a < X ^ by for X > b. 247 19.3-4 If a fr. f. f[x) — l''[x) exists, the truncated distribution has a fr. f . equal to for all X in [a, 6), and zero outside (a, b). Either a or b maj, of course, be infinite. 1. The truncated noi'vnal distribution. Suppose that the stature of an individual presenting himself for military inscription may be regarded as a random variable which is normal (m, 0 ). If only those cases are passed where the stature exceeds u fixed limit Xq, the statures of the selected individuals will yield a truncated normal distribution, with the d. f. ( /• > Writing ). — 'jv, — m\ m } 1 - <pl fx„_- » 1 a — , the two first moments of the truncated distribution arc a, = m 4- Xa, a, == in* + A oCj-y + m) + O'. If and are given, while m and a are unknoAvn, two equations are thus available for the determination of the two unknown quantities. Tables for the numerical solution of these equations have been published by K. Pearson (Ref. 264). 2. Paretos distribution. In certain kinds of economic statistics, we often meet truncated distributions. Thus e. g. in income statistics the data supplied are usually concerned with the distribution of the incomes of persons whose income exceeds a certain limit Xq fixed by taxation rules. This distribution, and certain analogous distributions of property values, sometimes agree approximately with the Pareto distribution defined by the relation P(? > x) = j ^ ^ ^)’ (X /Xn\“"l ^ The fr. f. of this distribution is — 1-1 for x > x^y, and zero for x ^ Xq. The ^0 \ ^ / mean is finite for a > 1, and is then equal to ^ x^f. The median of the distribu- tion is 2« Xo. -- With respect to the Pareto distribution, we refer to some papers by Hagstroem (Kef. 121, 122). 19.4. The Pearson system. — In the majority of the continuous distributions treated in Chs. 17 — 19, the frequency function y=f{x) satisfies a differential equation of the form 248 (19.4.1) 19.4 , _ X 4- a ^ ~ io + iiic + where a and the &:s are constants. It will be easily verified that this is true e. g. of the normal distribution^ the x* distribution, Student’s distribution, the distribution of Fisher’s ratio the Beta distribu- tion, and Pareto’s distribution. Any distribution obtained from one of these by a linear transformation of the random variable will, of course, satisfy an equation of the same form. The difPerential equation (19.4.1) forms the base of the system of frequency curves introduced by K. Pearson (Ref. 180, 181, 184 etc.). It can be shown that the constants of the equation (19.4.1) may be expressed in terms of the first four moments of the fr. f., if these are finite. The solu- tions are classified according to the nature of the roots of the equa- tion + Z/, a? -I- ftg a:* — 0, and in this way a great variety of possible types of frequency curves ^ == /(x) are obtained. The knowledge of the first four moments of any fr. f. belonging to the system is suffi- cient to determine the function completely. A full account of the Pearson types has been given by Elderton (Ref. 12), to which the reader is referred. Here we shall only mention a few of the most important types. The multiplicative constant A appearing in all the eijuations below should in every case be so determined that the inte- gral with respect to x over the range indicated becomes equal to unity. Type I. y = A (x — {b — ; a < x < h ; p > 0^ q > 0. For a ~ 0, h -- 1 we obtain the Beta distribution (18.4.3) as a par- ticular case. Taking j) == q - J a——h, and allowing h to tend to infinity, we have the normal distribution as a limiting form. Another limiting form is reached by taking q = ba; when h oo we obtain after changing the notations the following Type III. // = A (x — X > jii; a > 0, A > 0. This is a generalization of the ivA. f(x\ a,X) defined by (12.3.3), and thus a fortiori a generalization of the ;t^-distribution (18.1.3). Type VI. // — A (x — aY~ ^ (x — bY ~' ^ ; x >b \ a<b^ <7 > 0, ^ < i . This contains the distribution (18.3.2) as a particular case (a~— 1. b ^ 0 ). Type VII. y - - «> < x < oo; m > i This contains Student’s distribution (18.2.4) as a particular case. 249 20.1-2 CHAPTER 20. Some Convergence Theorems. 20 . 1 . Convergence of distributions and variables. — If we are given a sequence of random variables ^ 9 > • • • ^ith the d. f;s Fi{x)^ F^[x), . . it is often important to know whether the sequence of d. f:s converges, in the sense of 6.7, to a limiting d. f. F[x), Thus e. g. the central limit theorem asserts that certain sequences of d. f;s converge to the normal d. f. 0{x). — In the next paragraph, we shall give some further important examples of cases of convergence to the normal distribution. It is important to observe that any statement concerning the con- vergence of the sequence of d.f:s {Fn(a:)} should be well distinguished from a statement concerning the convergence of the sequence oj variable's {^w}. We shall not have occasion to enter in this book upon a full discussion of the convergence properties of sequences of random variables. In this respect, the reader may be referred to the books by Frechet (Ref. 15) and Levy (Ref. 25). We shall here only use the conception of convergence in probability y which will be treated in the paragraphs 3 — 6 of the present chapter. 20 . 2 . Convergence of certain distributions to the normal. — 1. The Poisson distribution, — By 16.5, a variable ^ distributed in Poisson’s distribution has the mean A, the s. d. Vx and the c. f. (<;*'-!) standardized variable --y thus has the c. f. Vx ^ Xf/ ^ 2 3,V; As X tends to infinity, this tends to e and by the continuity theo- rem 10.4 the corresponding d. f. thpn tends to (D(x). Thus ? is asymptotically normal (A, V X). 2. The distribution. — For n degrees of freedom, the variable has by (18.1.6) and (18.1.2) the mean the s. d. 1 2w, and the ^ n. f . (\—2ii) 2 . Thus the standardized variable has the c. f . V2n 250 20.2 and for every fixed t we may choose n so larg’e that this may be written in the form 3 I 'I’)"”’ where 1 | ^ 1 . As w oo, this evidently tends to and thus the d. f. of o ^ tends to 0{x), so that is asymptotically normal (//, V2n). \ 2n Consider now the probability of the inequality \ 2 y- <V2u -t r, which may also be written < a -r lx ~ \ \ 2 n . \ 2\2nJ As w - - oo while x remains fixed, tends to zero, so that the 2\ 2v probability of the above inequality tends to the same limit as the probability of the inequality + x \ 2 n, i. e. to (I>(x). Thus the variable I 2 y' is asymptotically normal [V2n, 1). — According to R. A. Fisher (Ref. 13), the approximation will be improved if we replace here 2 7i by 2n — 1 , and consider 1 2 yr as normally distributed with the mean 1 2w — 1 and unit s. d. As soon as n ^ 30, this gives an approximation which is often sufficient for practical purposes. 3. Studeut'fs (listrihutioi. — The fr. f . (18.2.4) of Student’s distri> biition may be written ‘( 20 . 2 . 1 ) '■("D I / '■*'* 'iVn («\ >') V 2 ' (2) By Stirling’s formula (12.5.3), the first factor tends to unity as and for every fixed x yve have - 2 + 2' 251 20 . 2-3 so that ( 20 . 2 . 2 ) ..w - Further, let r denote the greatest integer contained in * Them r and thus we have for all n^\ and for all real x n+l /i . /. x^Y - 1 ' 1 Thus the sequence {t'?n(a:)} is uniformly dominated by a function of the form A(1 + so that (5.5.2) gives (20.2.;{) S„ (x) = j s„ (/) d t - j e~'^ (I ( <P (r). 4 . The Beta distrihiition. — Let ^ be a variable distributed in the Beta distribution (18.4.3), with the values np and nq of the para- meters. The mean and the variance of $ are then, by 18.4, — " — and p + q pq — ^ ^ ;-x* Let now n tend to infinity, while p and q re- (p ^ <l) h^P + -f 1) main fixed. By calculations similar to those made above, it can then be proved that the fr. f. of the standardized variable tends to the 1 normal fr. f. » and that the corresponding d. f. tends to the normal d. f. 0(x). 20.3. Convergence in probability. — Let ... be a sequence of random variables, and let P\(x) and q>n(t) denote the d. f. and the c. f. of We shall say (cf Cantelli, Ref. 64, Slutsky, Ref. 214, and Frechet, Ref. 112) that converges in prohahility to a consta}it c if, for any e > 0, the probability of the relation [fn ■— c| > f tends to zero as « oo. Thus if denotes the frequency vin of aif event PI in a series of n repetitions of a random experiment (i, Bernoulli’s theorem 16.3 asserts that v/}i converges in probability to p. A necessary and sufficient condition for the convergence in prob- ability of to c is obviously that the d. f. T\i{x) tends, for every fixed X ^ c, to the particular d. f. £(./• — c) defined in 16.1. 252 20.3-5 By the continuity theorem 10.4, an equivalent condition is that the c. f. q)n[t) tends for every fixed t to the limit 20.4. Tchebycheff’s theorem. — We shall prove the following theo- rem, which is substantially due to TchebycheflF. Let . . . he random variables^ and let nin and On denote the mean and the s. rf. of If On 0 as n oo ^ then — win converges in prohahility to zero. In order to prove this theorem, it is sufficient to apply the Bienaym4-TchebycheflF inequality (15.7.2) to the variable — win- We then see that the probability of the relation | s'n nifi I ^ t 18 ^ 2 ’ and by hypothesis this tends to zero as n oo. Let us now suppose that the variables fg, . . . are independent, and write 1 1 We then have the following corollary of the theorem: If (20.4.1) 2*^' = 1 then ^ — m converges in prohahility to zero. The variable | has, in fact, the mean m and the s. d. 1 n Or. By hypothesis, the latter tends to zero as w -> oo , and thus the truth of the assertion follows from the above theorem. In the particular case when the are the variables considered in 16.6, in connection with a series of independent trials, On is bounded and thus (20.4.1) is satisfied. The corollary then reduces to the Poisson generalization of Bernoulli’s theorem. 20.5. Khlntchine’s theorem. — Even if the existence of finite standard deviations is not assumed for the variables considered in the preceding paragraph, it may still be possible to obtain a result corresponding to the corollary of Tchebychefif’s theorem. We shall only consider the case when all the have the safme probability distribution, and prove the following theorem due to Khintchine (Ref. 139). 253 20.5-6 ?i!) • • • independent random variables all having the same d.f. F{x), and suppose that F(x) has a finite mean m. Then the variable - 1 J converges in probability to m. 1 If q>[t) is the c. f. of the common distribution of the the c. f. of the variable f is (w)) According to (10.1.3), we have for ^ 0 q){t)—\ -f m/f + o(f), and thus for any fixed as w oo, According to 20.3, this proves the theorem. 20.6. A convergence theorem. — The following theorem will be useful in various applications: Fet ^ 1 , • be a sequence of random variables, ivith the d.f:s F^, F^, .... Suppose that Fn{x) tends to a d.f. F(x) as n qo. Let fjii • • • be another sequence of random, variables, and supj)ose that fjn cont'erges in probability to a constant c. Put (20.0.1) Xn = fn -f Yn = fn • Tjn Then the d.f of Xn tends to P^(x — c). Further, if e > 0, the d.f. of Yn tends to F^^, while the d.f. of Zn tends to F[cx). (The modifica- tion required tvhen c < 0 is evident.) It is important to observe that, in this theorem, there is no con- ditioti of independence for any of the variables involved. It is sufficient to prove one of the assertions of the theorem, as the other proofs are quite similar. Take, e. g., the case of Zn- Let x be a continuity point of F[cx\ and' denote by Pn the joint prob- ability function of and r;n. We then have to prove that P»(^= Sa:) ^P(cx) as »->■<». Now the set S of all points in the (?„. jj»)-plane such that 254 20.6 ^ X is the sum of two sets and without common points^ rjn defined by the inequalities -sv r}n 1 — c 1 ^ S,: Sn ^ ~ ^ X, 1 »?» — C 1 > «. rjn Thus we have Pn{S) = Pn(Si) -f Pn(S^)> Here is a subset of the set \r]n — ^ I > «i and thus by hypothesis Pn (S^) 0 for any £ > 0. Further, Pn(S^) is enclosed between the limits Pn (sn ±e)X, — C\S s). Each of these limits differs from the corresponding quantity Pn (bn ^ (C ± 6) X*) — Fn ((c ± c) x) by less than Pn(\r]n — c \ > e). As ?/ -> oo , the latter quantity tends to zero, and we thus see that Pn (S) is enclosed between two limits, which can be made to lie as close to F(cx) as we please, by choosing € sufficiently small. Thus our theorem is proved. Hence we deduce the following proposition due to Slutsky ( Ref. 214): If 7]n^ . . ., Qn random variables converging in probabi- lity to the constants x, y, . . r respectively, any rational function ^(?n, r}ny . . ^n) couvcrges in probability to the constant It[x, y, . , r), provided that the latter is finite. It folloivs that any power r}n, ■ . ^«) with k> 0 converges in probability to ^*)- Exercises to Chapters 15-20. 1. The \ariab1e ^ has the fr. i. f{x). Find the fr. f:8 of the variables r} -- ^ and = cos Give conditions of existence for the moments of 7 / and 5- k 2. For any k > 1, the function /(.x) = - , is a fr. f. with the range (—GO, Qo). Show that the n:th moment exists when and only when n < k. 3. The inequality (16.4.6) for the absolute moments fin is a particular case of the following Inequality due to Liapounoff (Ref. 147). For any non-negative w, p, g (not neces.sarily integers), we have ^ loB /Sft+p = p + p fin+p+,r 255 Exercises For n = 0, q — If this reduces to (16.4.6), since = 1. The general inequality ex- presses that a chord joining two points of the curve y = log fixf (x > 0), lies entirely above the curve, so that log fix is a convex function of x. (Fur a detailed proof, see e. g. Uspensky, Kef. 89, p. 266.) 4. When ff (x) is never increasing for j* > 0, we have for any k > 0 oo oo k^Jg(x}dx ^ J f X* gix) dx. k 0 First prove that the inequality is true in the particular case when (x) is constant for 0 < X < c, and equal to zero for x > c. Then define a function h{x) which is con stantiy equal to g{k) for 0 < x<lr + a, and equal to zero for x > k a, where a is oo determined by the condition ag{k)= j g\x^dxy and show that A oo oo oc oo /(f* j' g{x^ d'X k* j' h i.r's dx ^ i f x^ h (./ dx = i j * g[.x dx. i- k 0 0 Use this result to prove the inequalities (16.7.3 and 16.7.4. 5, If F{x) is a d, f. with the mean 0 and the s. d. o, we have F\j) ^ -- 4 .r* r* for X < 0, and Fix] ---- — for x > 0. For x < 0, this fcdlows from the ineqiialitie.s \ .r* * oo 00 “ X ^ j [y — X) dF s ^ {y — x d Fy — 00 'X « oo oo x’ s (/ ly -X (lij g / dF- fiy-x'dF^ 1 - F x a"- I .1’ . r XX For X > 0, the proof is similar. Show by an example that these inequalities cannot be improved. 6. The Bienayme-Tchebycheff inequality (15.7.2) may be improved, if some central moment with n > 1 is known. We have, e. g., for k > 1 P(|| — m I k (t) ^ j“4 7 //4 4 k*a* — 2 k* o* Yi + 2 {k* - D* 4 y, 4 2 ■ Apply vl6.7.1 with K ~1 and g{^) — 1 -\- o’ ik'— Vi ~ mi* ~ k* k* a* -2k^a^ 7. Use (16.4.6) to show that the semi-invariant of an arbitrary distribution satisfies the inequality \x^\^ fin' 'Cramer, Kef. 1], p. 27. 8. Prove the inequality |a 4 ^ 2«-t(|rt|« 4" |6|»). Hence deduce that, if the w:th moments of x and y exist, so does the n:th moment of x 4 y. 256 Exercises 9. Writing G(j)yq)= I show that th© first absolute moment r>np about the mean of the binomial distribution is B(| V - np I) = 2 p 9 =2 ft Qpl^q'^-t^+K where fi is the smallest integer > np. For large n, it follows that E (| V — n ;? |) c\3 • 10 . Show that if 1 — F{x)=^ as x-~* 4* «>, and E(.r) = as — 00 (c > 0), the distribution is uniquely determined by its moments. 11 . The factorial moments (Steffensen, Kef. 217) of a discrete distribution are where denotes the factorial xix — 1) . . . (.r — r + 1). Similarly r the central factorial moments are = S />,. Express and by r means of the ordinary moments. Show that f • - 4 \ and hence deduce relations between and 12 . The c. f. of the distribution in the preceding exercise is jo,. ^ ^ r Substituting here t for c*^ we obtain the generating function \p[t)= Show r that (r» = a^^j, and in particular E(,i) = D* 4 — (v' (1)'*. Use this result to deduce the expressions ~ for the binomial distribution, and *= k* for the Poisson distribution. 13 . a) We make a series of independent trials, the probability of a »succes8>^ being in each trial equal to = 1 — q, and we go on until we have had an uninter- rupted set of V successes, where v > 0 is given. Let p^^^ denote the probability that exactly n trials will be required for this purpose. Find the generating function VW = i Pn^e n=l ^y a -pj) 1 - f + p’’? ’ 2 V and show that E (n) — yj' (1) = — • p^q b) On the other hand, let us make n trials, where n is given, and observe the length fi of the longest uninterrupted set of successes occurring in the course of these n trials. Denoting by the probability that fi < Vy show that and thus V ^ . Pi Pn V » 2 p«.<" »»=l 1 - p* r i-< + p^?>+‘‘ 17 — 464 H. Cramir 257 Exercises Hence it can be shown (Cramer, Hef. 68) that tends to zero as n-^ oo, uniformly for 1 ^ v ^ n. It follows that for large n we have E (fx) = + 0 a), D^fi) = 0(1). log - p 14. The variable | is normal (m, o). Show that the mean deviation is J(| l-m 0.79788 a. 15. In both cases of the Central Limit Theorem proved in 17.4, we have f - w I 1/2 I > 1/ as n — 00 . — o \ V 7t Use (7.6.9) and (9.6.1). (Cf Ex. 9. 16. Let ^2, . . . be independent variables, such that has the possible values 0 and ± the respective probabilities being 1 — oml Jy-2a. Thus has the mean 0 and the s. d. 1. Show that the Liapoiinoff condition (17.4.8) is n satisfied for a < i, but not for a J. Thus for a < J the sum ^ is asymp- 1 totically normal (0, V n). For cc > i, the probability that li = • • = = 0 does not tend to zero as n— ^ go, bo that in this case the distribution of ^ does not tend to normality. The last result holds also for a = cf Cramer, Ref. 11, p. 62. 17. If a, and are the two first moments of the logarithmico-normal distribu- tion (17,6.3), and if fj is the real root of the equation r/® -f 3 — y, ==■ 0, where y, is the coefficient of skewness, the parameters a, m and a of the distribution are given by rt a, — a* log (1 + 7^*), n m = log («! — a) — i (J*. 18. Consider the expansion (17.6.8) of a fr. f. /(.r'l in Gram-Charlier series, and 1 1 take /(.r) = — 7= e For a: = 0, we have /(O) = — 71— , and the expansion be- 0 9 271 or 27t comes (2r)! V27t^2’^^{v\T (1 - oV. This is, however, only correct if o' ^ 2. For o* > 2, the series is divergent. Find a and such that af{x) + fifip^x) is the fr. f. of a standardized variable, and show by means of this example that the coefficient ^ in the convergence condition (17.6.6 a) cannot be replaced by any smaller number. 19 . Calculate the coefficients yi and y^ for the various distributions treated in Ch. 18. 258 Exercises 20. If the variable ri is uDiformly distributed over ja — h, a I- the c. f. of is sin ht ht If 5 is an arbitrary variable independent of »/, with the c. f. (p ~t), the sum s + has the c. f. ■ sin ht ht eaif<p{1). Show that, by the aid of this result, the formula 10.8.3) may be directly deduced from (10.8.1). 21. Let n be a random variable having a Poisson distribution with the probabilities .r’’ rl with the fr. f. - e-^, where v — 0, 1, ... . If we consider here the parameter a? as a random variable ru) tjjg probability that n takes any given value r is J v! i’u'. \l + «/ \ V H o)' Find the c. f., the mean and the s. d. of this distribution, which is kno\vn as the negative binomial distribution. 22. .r,, ,r,, . . . are independent variables having the same distribution with the mean 0 and the s, d. 1. Use the theorems 20.6 and 20.6 to show that the variable.^ Xy -4 + .!•„ .t\ H h .r„ If - \ n .^1 'f ■ + .<1 + • - -5 and z = - ; are both asymptotically normal (0, 1). 23. If and .Vn asymptotically normal (o, hiVn) and (b, k/^ n) respectively, where 6^0, then the variable = Vn — fi)/yn is asymptotically normal (0, h/b\ — Note that there is no condition of independence in this case. 259 Chapters 21 — 24. Variables and Distributions in CHAPTER 21. The Two-Dimensional Case. 21 . 1 . Two simple types of distributions. — Consider two one-di- mensional random variables ^ and rj. The joint probability distribution (cf 14.2) of f and 17 is a distribution in or a two-dimensional dis- tribution. This case will be treated in the present chapter, before we proceed to the general case of variables and distributions in n dimen- sions. According to 8.4, we are at liberty to define the joint distribution of ^ and f] by the probability function P{S), which represents the probability of the relation (^, rj) C 5, or by the distribution function given by the relation F{x,y) = P(^ ^ x,r) ^ }f). We shall often interpret the probability distribution by means of a distribution of a unit of mass over the (^, ? 7 )-plane. By projecting the mass in the two-dimensional distribution on one of the coordinate axes, we obtain (cf 8.4) the marginal distribution of the corresponding variable. Denoting by (x) the d. f. of the marginal distribution of and by Fg (//) the corresponding function for 17 . we have F,(.r) = P(?^:r) = F(x,<»), Fg (//) = P (17 ^ y) == F(oo , y). As in the one-dimensional case (cf 'J5.2), it will be convenient to introduce here two simple types of distributions: the discrete and the continuous type. 1 . The discrete type. A two-dimensional distribution will be said to belong to the discrete type, if the corresponding marginal distri- butions both belong to the discrete type as defined in 15.2. In each 260 21.1 luarg'inal distribution, the total mass is then concentrated in certain discrete mass points, of which at most a finite number are contained in any finite interval. Denote by a:,, iCg, . . . and by ^ 2 * • • • the discrete mass points in the marginal distributions of ^ and t] respec- tively. The total mass in the two-dimensional distribution will then be concentrated in the points of intersection of the straight lines ^==Xi and r] = yk, i. e. in the points (a:,-, yjt), where i and Ic indepen- dently assume the values 1, 2, 3, . . . If the mass situated in the point (ti, y*) is denoted by pik, we have (21.1.1) P[t = xt,rj = yi) = pik, while for every set S not containing any point {xi, yk) we have P(S) = 0. Since the total mass in the distribution must be unity, we always have ^Pik = 1 . For certain combinations of indices t\ h we may, of course, have Pik = 0 . The points {xi, yk) for which pik > 0 are the discrete mass points of the distribution. Consider now the marginal distribution of the discrete mass points of which are x^, a?,, . . . If p,. denotes the mass situated in the point Xi, we obviously have ( 21 . 1 . 2 ) • k Similarly, in the marginal distribution of 17 , the point yk carries the mass 2Kk given by (21.1.3) ^ ~ 2-^** ' t By (15.11.2), a necessary and sufPicient condition for the indepen- dence of the variables ^ and rj is that we have for all i and Ic (21.1.4) Ptk=Pi\p.k . 2 . The contimiom type, A two-dimensional distribution will be said to belong to the continuous type, if the d. f . F [x, y) is every- where continuous, and if the fr. f. (cf 8.4) d^F 261 21 . 1-2 exists and is continuous everywhere, except possibly in certain pointa belonging to a finite number of curves. For any set S we then have P{s) = ff{x,y)dx dy, ,S’ and thus in particular for S oo oo f j f(x,y)dxdy==l. — 00 — 00 The marginal distribution of the variable ^ has the d. f. P[^^x) = f Jf{t,u)dtdu = ffi(f)df, — 00—00 00 fAx) = f fix, y) dy. — oo If, at a certain point x == Xq, the function f(x^ y) is continuous with respect to x for almost all (cf 5.3) values of y and if, in some neighbourhood of we have f{x, y)< G [y], where G (y) is integrable over ( — oo, oo), then it follows from (7.3.1) that fi{x) is continuous at X = Xq. In all cases that will occur in the applications, these condi- tions are satisfied for all Xq, except at most for a finite number of points. In such a case /i(.^) has at most a finite number of discon- tinuities, so that the marginal distribution of 5 is of the continuous type and has the fr. f. /j (x). Similarly, we find that the marginal distribution of r] has the fr. f. (21.1.6) /* (y) = j /(x, y) d x. — OO By (15.11.3), a necessary and sufficient condition for the indepen- dence of the variables § and rj is that we have for all x and y (21 -I-?) /(x, y) =/i (x)/, (y). 21.2. Mean values, moments. — The mean value of a function integrable over with respect to the two-dimensional pr. f. P(S) has been defined in (15.3.2) by the integral E(gi^, T])) = f g{x, y) dP{S). H. 262 where (21.1.5) ( 21 . 2 . 1 ) 21.2 For a distribution belonging to one of the two simple types, this reduces to a sum or an ordinary Lebesgue integral, as indicated in 15.3 for the one-dimensional case. The fundamental rules of calcula- tion for mean values have already been deduced in 15.3 for any num- ber of dimensions. The moments of the distribution (cf 9.2) are the mean values (21.2.2) a.k = = / a^/dP(S), where i and h are non-negative integers. The sum i Ic ol the in- dices is the order of the moment aa-. The moments cr,o = iS(50 and aok = E[r]^) are identical with the moments of the one-dimensional marginal distributions of $ and r] respectively, as shown by the integral relation (9.2.2). In particular, we put «io = -E (§) = m, , Ooi = E (rj) = m*. The point with the coordinates ^ = mj, t] == is the ceittre of gravity of the mass of the two dimensional distribution. For the moments about the centre of gravity we shall use a particular notation, writing in generalization of (15.4.3) (21.2.3) /i/jk = E((f — mfiri — mg)*'). Thus in particular we have = fiQi = 0 and jUgo = oi, ^og = where and (Tg are the standard deviations of ^ and rj. Between the moments or,* and the central moments fUk we have relations analogous to those given in 15.4 for the one-dimensional case. Thus for the second order moments we have (21.2.4) /ttjo = ttjo — , /u„ = a„ — Wimj, ^11 is often called the second order product moment or mixed moment. Further, while and ^os ^-re the variances of ^ and 17 , the product moment is also called the covariance of ^ and rj. In the particular case when the variab]e.s ^ and 9 / are independent, we have by the multiplication theorem (16.3.4) Thus in particular we have in this case //jj = = 0. / For any real t and u have (21.2.5) E[(f(^ — m,) + «(ij — »w,))*] = ^ 263 21.2 The first member of this identity is the mean value of a square, and is thus non-neg^ative. It follows that the second member is a non- negative quadratic form (cf 11.10) in t and w, so that the moment matrix M = \ I is non-negative, and we have I tUl /^03 ) (21.2.6) The rank r of M may (cf 11.6) have one of the values 0, 1 and 2. When r — 2, we have the sign > in (21.2.6), while the sign = holds for r — 1 and r 0. We shall now show that certain simple proper- ties of the distribution are directly connected with the value of r. We have r = 0 when and only when the total mass of the distribution is situated in a single point. We have r = 1 when and only when the total mass of the distribw tion is situated on a certain straight line, but not in a single point. We have r — 2 when and only when there is no straight line that contains the total mass of the distribution. It is obviously sufficient to prove the cases r = 0 and r = 1 , as the case r — 2 then follows as a corollary. — When r = 0, we have ^20 “ 1^02 ~ so marginal distribution of each variable has its total mass concentrated in one single point (cf 16.1). In the two- dimensional distribution, the whole mass must then be concentrated in the centre of gravity (n?,, m^. Conversely, if we know that the whole mass of the distribution belongs to one single point, it follows immediately that = 0, and hence by (21.2.6) = 0, so that M is of rank zero. Further, when ?*=1, the form (21.2.5) is semi-definite (cf 11.10), an thus takes the value zero for some f = u = Uq not both equal to zero. This is only possible if the whole mass of the distribu- tion is situated on the straight line (21.2.7) fo(s ~ ^'^i) + '^ 2 ) == Conversely, if it is known that the total mass of the distribution is situated on a straight line, but not in a single point, it is evident that the line must pass through the centre of gravity, and thus have an equation of the form (21.2.7). The mean value in the first mem- ber of (21.2.5) then reduces to zero for t tQ, U Uq, so that the quadratic form in the second member is semi-definite, and it follows that M is of rank one. Thus our theorem is proved. 264 21JHS Let us now suppose that we have a distribution such that both variances and ^ are positive. (This means i. a. that M is of rank 1 or 2.) We may then define a quantity q by writing (21.2.8) p = . y f^iO f^Oi ‘'l By (21.2.6) we then have ^*^1, or — 1 1. Further, the case ^* = I occurs when and only when M is of rank 1, i. e. when the whole mass of the distribution is situated on a straight line. — In the particular case when the Tariables ^ and tj are independent, we have = 0 and thus ^ = 0. The quantity ^ is the correlation coefficient of the variables ^ and r}\ this will be further dealt with in 21,7. Suppose that we are given any quantities m^y and any jUu , /Uoa subject to the restriction that the quadratic form //20 + 2 u* is non-negative. We can then always iind a distribution having mi, for its first order moments and fjiiQ, fill, (^oa second order central moments. The required conditions are» 1 “4" p e. g., satisfied by the discrete distribution obtained by placing the mass — - — in 4 each of the two points (mj 4 - Oi, wig 4 - a^) and {nii — Oi, — Oj), and the mass in each of the two points {nii + 0|, w/2 ~~ <^2) and,(wi “Oi, m2 + Ug). The quantities Oi, O2 and Q are here, of course, defined according to the above expressions. 21.3. Characteristic functions. — The mean value (21.3.1) q) (/, «.) = “•-■') = f dP is the characteristic function (c. f.) of the two-dimensional random variable (^, rj), or of the corresponding distribution. We shall also often call tp [t, u) the joint c. f. of the two one-dimensional variables ^ and rj. According to the theory of c. fis given in Ch. 10, the one-to-one correspondence between one-dimensional distributions and their c. f:s (cf 15.9) extends itself to distributions in any number of dimensions. If two distributions are identical, so are their c. f:s, and conversely. If the second order moments of the joint distributi6n of ? and 17 are finite, we have in the neighbourhood of the point ^ = w == 0 the development analogous to (10.1.3) 265 21.3 (21 .3.2) y (<,«)=* 1 + ^ (a,o t + «o, «) + “ (oto <* + 2 a„ < M + oo, «*) + + 0 (t* + M*) = is'<*> '+"•“> 1^1 + ^(/iio^ + 2/*ii<M+iUoi“*) + o(<* + «*)J- In the particularly important case when the mean values nii and are both equal to zero, we thus have (21.3.3) gp (f, w) = 1 — J (fiiQ + 2 tu -i- fiQ^ w*) + o (^* + w*). The c. f:s of the marginal distributions of ^ and rj are (21.3.4) E(€*^^)=^ip(t,0l and E(e^^^) = g)(0,u). If the variables $ and tj are independent, we have q>(t,u) = E(e^^^ • = E{e*^^) • E(c»“’»), BO that the joint c. f. gp (ty u) is the product of the c. f is of the mar- ginal distributions corresponding to ^ and rj respectively. Conversely, suppose that it is known that the joint c. f. of f and rj is of the form gp, (^) • gp* (u). Introducing, if necessary, a multipli- cative constant into the factors, we may obviously assume gpj (0) = = gp, (0)=1, and then it follows from (21.3.4) thatgPi(^) and gP 2 (w) are the c. f:s of 5 and rj respectively. If the two-dimensional interval defined by a, < ^ < &i, < bg is a continuity interval (cf 8.3) of the joint distribution of ^ and rj, it further follows from the inversion formulae (10.3.1) and (10.6.2) that we have the multiplicative relation P(fli < I < 6i, a^<r]< ftj) = P(a, < 5 < ^) * -P(«2 < V < ^s)- Allowing here ai and to tend to — - <x>, we obtain in particular, using the same notations as in 21.1, F(x,y) == Fi{x) F^iy) for all x and y that are continuity points of Fi and F^ respectively. By the general continuity properties of d. f:s, this relation is immediately extended to all x and y. From (14.4 5) it then follows that the variables $ and rj are independent, and we have thus proved the following theorem: A necessary and sufficient condition ftyr the independence of two one- dimensional random variables is that their joint e. f. is of the form (21 .3.5) g> (t, u) = 9P, (t) y, (m). 266 21.4 21.4. Conditional distributions. — The conditional distribution of a random variable rj, relative to the hypothesis that another variable ^ belongs to some given set S, has been defined in 14.3. In the present paragraph, we shall consider this question somewhat more closely for distributions of the two simple types introduced in 21.1. 1. The discrete type. Consider the discrete distribution defined by (21.1.1), and let xt be a value such that the marginal probability F(^ = Xi) Pi. is positive. The conditional probability of the k event rj^yk, relative to the hypothesis ^ = a?/, is then by (14.3.1) (21.4.1) P{^ = Xi, ij = i(k) _ pa- P(§ =x,) pu For a fixed Xi, the conditional probabilities of the various possible values of yk define the conditional distribution of 17 , relative to the hypothesis ^==Xi. The sum of all these conditional probabilities is, of course, equal to 1 . If the (^, i 2 )-di 8 tribution is interpreted in the usual way as a dis- tribution of a unit of mass over the points (a:,, y*), the conditional distribution is obtained by choosing a fixed Xi and multiplying each mass situated on the vertical through the point ^ = X/ by the factor 1 /Pi., so as to make the sum of all the multiplied masses equal to unity. The conditional mean value of a function g (^, rj), relative to the hypothesis J = xt, is defined as the mean value of g[xi, 7 ]) with respect to the conditional distribution of rj defined by (21.4.1): liPiJtgixij yk) (2 1 .4.2) E (g (g, . dmi Pi k K For — we obtain the conditional mean of rj, which is the ordinate of the centre of gravity of the mass situated on the vertical ? = Xt'. '^Payt (21.4.3) = \ . 2d Pi k k On the other hand, taking fir(?, iy) = (17 — we obtain the condi- tional variance of r}. 267 21.4 The conditional distribution of relative to the hypothesis f) = and the correspondiDg conditional mean values, are defined by per- mutation of the variables in the expressions given above. In the particular case when ^ and t] are independent, (21.1.4) shows that we have Pik = Pi .p.k^ and this gives us . .> •?(§ = Xi\r,= y») =pi. = P(| = X,), (21.4.4) P{f] = yt\^ = =p.k= P{r] = t/t), in accordance with the general relations (14.4.2) and (14.4.3). 2. The continuous type. Let f{x,y) be the joint fr. f. of the vari- ables $ and 7}. Consider an interval (x^ x + h) such that the mass situated in the vertical strip x < ? < + A, which represents the probability x+h oo F{x < ^ < X + h}= f f f[x, y) dx dy, is positive. The conditional probability of the event rj S //, relative to the hypothesis .r < ^ < x + A, is then by (14.3.1) P(r]^y\x<^<X'¥h)=^ P(x < $ < X 4 - A, 'rj g y) P(x < y < + A) x + h y f jf[-r,y)dxdy X — OO jf A 00 / j f[^,y)dTdy X- —00 This is the d. f. corresponding to the conditional distribution of rj,. relative to the hypothesis x < ^ < x -{' h. It is simply equal to the quantity of mass situated in the strip x < ^ < x h and below the line 7} == y, divided by the total mass in the strip. Let now A tend to zero. If the continuity conditions stated in connection with (21.1.5) are satisfied at the point x, and if the marginal fr. f. /i(r) takes a positive value at the point x, it follows from (5.1.4) that the conditional d. f. tends to the limit • jfU'’V)dT] ff(x,r])di^ (21.4.5) lim P(r] ^y\x<B<x-\'h) — — y ^ -OO For fixed x, the limit is evidently a d. f . in and this will be called the jconditioval d.f. of rj, relative to the hypothesis ^~x. 268 21.4 f{x,y) is continuous in y, the conditional d. f. may be differen- tiated with respect to y, and we obtain the corresponding conditional fr.f. of Tj\ (21.4.6) /■(» k) = - oo The conditional mean value of a function 17), relative to the hypothesis § = x, is in this case 00 ,0 j g(x,y)f(x,y)dy E[9(lv)\i = x]= j g (x, y)f(y 1 x) • Multiplying by fi{x) and integrating with respect to x, we obtain 00 00 oo (21.4.7) Eg(^, t])=f f g {x, y)f(x, y)dxdy = f E\g(lri)\^= x]/, (x) dx. — 00 — 00 — 00 The conditional mean and the conditional variance of r) are oo J y f(x, y) dy (21.4.8) E(,j|| = a:) = m.(x)=---“’^ / f{x, y) dy — oo 00 i (y — TO*(x))*/(x, y)dy (21.4.9) D*(i 7U=^)=- s //(x, y)dy The point with the coordinates ^ = (x) is the limit, for A -> 0, of the centre of gravity of the mass in the strip x < f < a? + A. The conditional distribution of $ for a given value of 17, and the corresponding conditional mean values, are defined in a similar way. Thus e. g. the conditional fr. f. of relative to the hypothesis = y, is (21.4.10) f{x\y) f(x , y) Jm,y)d^ f{x, y) /t (x)/(y|'x) /*(y) My) 269 21.4-5 while the conditional mean E (§ | ij = y) = (y) is the mean of ? cor- responding to the fr. f . f[x | y). If 5 and rj are independent, we have/(:r,y) =/, (ip)/ 2 (y). It follows that in this case the conditional fr. f. of either Tariable is indepen- dent of the hypothesis made with respect to the other variable, and is identical with the fr. f . of the corresponding marginal distribution. Accordingly the conditional mean values for both variables agree with the mean values in the marginal distributions: (21.4.1 1) (y) = (;r) = 21.5. Regression, I. — Let ^ and rj be random variables with a joint distribution of the continuous type, and suppose that the cor- responding fr. f . f[x,y) satisfies the continuity conditions stated in connection with (21.1.5) for every x such that the marginal fr. f. /, (.r) is positive. According to the preceding paragraph, the conditional fr. f. /(y | x) given by (21.4.6) then represents the distribution of mass in an in- finitely narrow vertical strip through the point f = x. We may here think of ^ as an independent variable; to a fixed value ^=^x then corresponds a probability distribution of the dependent variable ri, with the£.f./(,|4 Consider now some typical value of this conditional i^-distribution, such as the mean, the mode, the median etc. Generally this value will depend on .r, and may thus be denoted by y^. As x varies, the point [x.yx] will describe a certain curve. From the shape of this curve we obtain information with respect to the location of the condi- tional Tj distribution for various values of (Cf fig. 22 a.) A curve of this type will be called a regression curve, and will be said to represent the regression of rj on In the sequel we shall al- ways, unless explicitly stated otherwise, choose for yx the conditional mean m^ix) of the variable rj, as given by (21.4.8), and so obtain the regression curve for the mean of rj as the locus of the point (x, when X varies: (21.5.1) y = m^i^x) = E(7;| f = x). If, instead of ^*, we consider t] as our independent variable, the conditional fr. f. of the dependent variable f for a fixed value rj = y is given by (21.4.10). Any typical value x,, of the conditional distri- bution of § gives rise to a regression curve representing the regression 270 21.S F;ig. 22. a) Regression of 1 / on b) Regression of | on of § on t]. (Cf fig. 22 b.) Thus the regression curve for the mean of ^ is the locus of the point (mj (y), y) when y varies, and lias the equation (21.5.2) a; = mi(.v) = E(||i2 = .v). The two regression curves (21.5.1) and (21.5.2) will in general not coincide. In many important cases occurring in the applications, both regression curves are straight or at least approximately straight lines. Thus e. g. in the particular case when § and rj are independent, it follows from (21.4.11) that the regression curves are straight lines parallel to the axes and passing through the centre of gravity (wi„ w,). — When a regression curve is a straight line, we shall say that we are concerned with a case of linear regression. The regression curves (21.5.1) and (21.5.2) possess an important minimum property. — Let us try to find, among all possible functions ^(1) of the single variable the particular function that gives the best possible representation or estimation of the other variable t}. Inter- preting the expression »be8t possible » in the sense of the least squares principle (cf 15.6), we then have to determine g{^) so as to render the expression (cf 21.4.7) E{ri- g (5)]* = / / [» — S' {x)]*f[x, y) dx dy — oo —00 (21.5.3) 00 00 = //i (^) jhf — g (a:)]V(y I ^) dy 271 21.5-6 as small as possible. By 15.4 the integral with respect to y in the last expression becomes, however, for every value ot x a, minimum when g(x) is equal to the conditional mean Thus the minimum of E[f} — among all possible functions ^(f), is attained for the func- tion g (^) == mg (^), which is graphically represented by the regression curve (21.5.1). — Similarly, the expression E[^ ^ h(r])Y attains its minimum for the function h (ij) = wij ( 17 ), which corresponds to the regression curve (21-5.2). Similar definitions may be introduced in the case of a distribution of the discrete type, as given by (21.1.1). For every value of such that the marginal probability Pf ^ is positive, the conditional distribution of rj is given by (21.4.1). Jjct us consider some typical value of this distribution, e. g. the conditional mean m(0 given by (21.4.8). When f assumes all possible values we thus obtain a sequence of points (x^, m^^)) representing the regression of on Conversely, the regression of ( on 17 is re- presented by the sequence of points where is the conditional mean of relative to the hypothesis 1 / = y^.. In either case, we may connect the points cor- responding to consecutive values of i or k by straight lines, and consider the curves thus formed as the regression curves of the discrete distribution. 21.6. Regression, II. — lu the literature, we often find the name of regression curves applied also to another type of curves than that introduced in the preceding paragraph. We shall now proceed to a discussion of this other type of curves. In the minimum problem considered in connection with (21.5.3), we tried to find, among all possible functions g (^), one that renders the mean value of the square [rj— g (?))* as small as possible, and we have seen that the solution of this problem is given by the regression curve (21.5.1). Instead of considering all possible functions ^^(5) we may, however, restrict ourselves to functions belonging to some given class, such as the class of all linear functions, all polynomials of a given degree w, etc. Thus we require to find, among all functions g ($) belonging to such a class, one that gives a best possible represen- tation of f] according to the principle of least squares. In such a case, the minimum problem may still have a definite solution, but this will generally correspond to a curve diflEerent from the re- gression curve (21.5.1). Curves obtained in this way will be denoted as mean square regression curves, or briefly m. sq, regression curves}) The simplest case is that of the linear m,sq. regression. Here we propose to find the best linear estimate of rj by means of f, i.e. the linear function ^(?) = a -f that renders the mean value of the square ^) When the meaning is clear from the context, we shall often drop the »m. 8 q.». 272 21.6 (i? — 9 ®'® small as possible. Now we may write, using the nota< tions introduced in 21.2, and assuming ^ > 0, 0, /o. I- — Of — ^5)* = E(i 7 — m* — — Wi) -h w, — a — ( 21 . 0 . 1 ) — 2 Mn /? + A^of + (»w, — a — An easy calculation shows that the minimum problem has a unique solution given by (21.6.2) = a = »W8 — where q is the correlation coefficient defined by (21.2.8). Thus the m. sq. regression line of rj has the equation (21.6.3) y = wi, -h — m,). The line passes through (mj, m^), and the equation may also be written (21.6.4) We note that this line is defined for any distribution such that both variances are finite and positive, and not as the regression curves of the preceding paragraph for distributions of the two simple types only. The quantity /Sjj defined by (21.6.2) is the regression coefficient of 7} on §. When the values of a and given by (21.6.2) are introduced in (21.6.1), the latter expression assumes its minimum value {21 .6.5) Mol “ g*). H'iO The expression E(7] — a — J (g — a — fix)* dP may be con- sidered as a weighted mean of the square of the vertical distance y -- a — fix between a mass particle dP with the coordinates (xyp) and the straight line y = a -h fix. Since this mean becomes a mini- mum for the regression line (21.6.4), this line may be called the line of closest fit to the mass in the distribution^ when distances are measured along the axis of y, and the fit is judged according to the principle of least squares. In the case of a distribution such that the regression curve y = m^(x) as defined by (21.5.1) exists, the expression E[r} — a — fi^Y 273 1ft— 464 H. Cramir 21.6 may be written in the form E(t} — m,®)* + 2E[(7] — — cr — /?§)] + E(nij® -- a —/??)*. By (21.4.7) and (21.4.8) the second term of this expression is, however, equal to zero. Thus we obtain for any a and (21 .6.6) E (i? - n - 5)* = JB (i? - mg ®)* + E {m, {^) - a - /? ?)l Here, the first term in the second member is independent of a and so that the last term attains its minimum for the same values of a and p as the first member, i, e. for the values gfiven by (21.6.2). Since m^lx) — a — is the vertical distance between the regression curve y^m^(x) and the line y = a-\-^x^ it is thus seen that the m. sq. regression line (21.6.4) may also be considered as the line of closest fit to the regression cui've y (x), distances always being measured along the axis of y. It immediately follows that, in a case when the regression curve y = m^ (x) is a straight line^ this is identical with the m. sg, regression line (21.6.4). So far we have been concerned with the linear m. sq. regression of f] on In the converse case of the regression of 5 have to find the values of a and ^ that render the expression (21.6.7) E[l-a-^fiY = j{x-a-pyYdP as small as possible. In the same way as above, we find that the problem has a unique solution, and that the minimizing straight line x = a + ^y may be considered as the line of closest fit to the mass in the distribution, or to the regression curve x = (y), when dis- tances are measured horizontally, i. e. along the axis of x. The equation of this line, the m. sq, regression line of niay be written (21.6.8) y — m, __ 1 x — m^ and the regression coefficient has the expression (21.6.9) ^ = while the corresponding minimum value of the expression (21.6.7) is (21.6.10) £mln(?-0-/?1?)* = 0?(l-e*). 274 21.6 aj ^ ^ 1 1 /v+* / / / / 'J • f • / \ \ K / / / / 1 / V \ \ \ \ \ Fig. 28. M. sq. regre.ssion lines, nii = W 2 = 0, Oi — 02 = 1. 0 ) (> > 0, b) p < 0. Both m. sq. regression lines (21.6.4) and (21.6.8) pass through the centre of gravity The two lines can never coincide, except in the extreme cases q = ± 1 , when the whole mass of the distribu- tion is situated on a straight line (cf 21.2). Both regression lines then coincide with this line. When p = 0, the equations of the m. sq. regression lines reduce to // =-' Wg and X = so that the lines are then parallel with the axes. This case occurs e. g. when the variables ^ and rj are independent (cf 21.2 and 21.7). If the variables are standardized by placing the origin in the centre of gravity and choosing (T^ and Us as units of measurement for 5 and T] respectively, the equations of the m. sq. regression lines reduce to the simple form y = qx and y ~ xjQ. When q is neither zero nor ± 1, these lines are disposed as shown by Fig. 23 a or 23 b, according as ^ > 0 or p < 0. If, instead of measuring the distance between a point and a straight line in the direction of one of the coordinate axes, we consider the shortest y i. e. the orthogonal distance, we obtain a new type of regression lines. Let d denote the shortest distance between the point (^, ij) and a straight line L. If L is determined such that E{d^) becomes as small as possible, we obtain the orthogonal m, sq. regression line. This is the line of closest fit to the (^, 7 /)-distribution, when distances are measured orthogonally. Now E (d*) may be considered as the moment of inertia of the mass in the distri- bution with respect to L. For a given direction of L, this always attains its mini- mum when L passes through the centre of gravity. We may thus write the equa- tion of L in the form (§ — mi) sin q> —{ji — w*) cos ^ = 0, where 9? is the angle between L and the positive direction of the ^-axis. The moment of inertia is then E (d*) — E{[^ — nil) sin — (1? — m2) cos 9))’ — ^20 sin* 9? — 2 fill sin ip cos tp + /W02 cos* (p. 275 21.6 If, on each side of the centre of gravity, we mark on L a segment of length inversely proportional to ^ E{d*), the locus of the end-points when ip varies is an ellipse of inertia of the distribution. The equation of this ellipse is easily found to be (g — Ml? __ 2p(g - (iy — m,)* ^ aj oi at ai For various values of c we obtain a family of homothetic ellipses with the common centre (m^, mtl The directions of the principal axes of this family of ellipses arc obtained from the equation tg 2 y = 2 /ill /i*0 "" M 02 ’ and the equations of the axes are ( 21 . 6 . 11 ). y- nit , Z SZ A L ( Mm— — Mn)* + 4 juj, Here, the upper sign corresponds to the major axis of the ellipse and thus to the mini- mum of E[d*), i. e. to the orthogonal m. sq. regression line. In the case /ill = /iio - .iio* = 0 the problem is undetermined; in all other coses there is a unique solution. The parabolic w. sq. regression of order w > 1 forms a generaliza* tion of the linear m. sq. regression. We here propose to determine a polynomial 9 {^) = such that the mean value M == = E (17 — (f))* becomes as small as possible. The curve y g[x) is then the n:th order parabola of closest fit to the mass in the distri- bution, or to the regression curve y = w, [x). Assuming that all moments appearing in our formulae are finite, we obtain the conditions for a minimum: — Jj)] =/?oOrO + ■• + ^n«r + n.(l — «rl =0 for r = 0 , 1 , . . . , ti. If the moments a,* are known, we thus have w -f 1 equations to determine the w + 1 unknowns The calculations involved in the determination of the unknown coefficients may be much simplified, if the regression polynomial g (a:) is considered as a linear aggregate of '.the orthogonal polynomials pv{x) associated with the marginal distribution of For all orders such that these polynomials are uniquely determined (cf 12.6), we have 00 ( 21 . 6 . 12 ) E(pm(^Pn(^)= f Pm{x)pn{x) dF^{x) = \} J (0 for n, 276 21 . 6-7 where •pn{x) is of the M:th degree, and i\[x) denotes the marginal d. f. of Any polynomial g(x) of degree n may be written in the form 9 W = CqPo(x) - -h CnPn {x) with constant coefficients The conditions for a minimum now become (21.G.13) I - ,?)] = c, - E{t}pA^!) =- 0. Hence we obtain Cv = E{Tj pv (^), so that the coefficients are ob- tained directly, without first having to solve a system of linear equations. It is further seen that the expression for Cv is independent of the degree v. Thus if we know e. g. the regression polynomial of degree w, and require the corresponding polynomial of degree w4-l, it is only necessary to calculate the additional term Cn+ipn+i(x). — Introducing the expressions of the into the mean value M, we find for the minimum value of M (21.6.14) E^in iv - = E(rj^) -cl cl. It should finally be observed that it is by no means essential for the validity of the above relations that the pv(x) are polynomials. Any sequence of functions satisfying the orthogonality conditions (21,6.12) may be used to form a in. sq. regression curve y — g[x)=^ — ^Cypv[x), and the relations (21.6.13) and (21.6.14) then hold true irrespective of the form of the py(x). 21.7. The correlation coefficient. According to (21.2.8), the correla- iion coefficient p of ^ and rj is defined by the expression = , ^ O'* y E(^ — »i,)* E{tj — ntf)^ and we have seen in 21.2 that we always have — l^pSl. The correlation coefficient is an important characteristic of the (f, i;)-distri- bution. Its main properties are intimately connected with the two m. sq. regression lines y — wi, 1 X — m, p Oj 277 (21.7.1) 21.7 which are the straight lines of closest fit to the mass in the (^, rjY distribution, in the sense defined in the preceding paragraph. The closeness of fit realized by these lines is measured by the expressions ^ 2 ) — q'), respectively. Thus either variable has its variance reduced in the proportion (l — : 1 by the subtraction of its best linear estimate in terms of the other variable. These expressions are sometimes called the residual variances of rj and g respectively. When Q = 0, no part of the variance of r; can thus be removed by the subtraction of a linear function of and vice versa. In this case, we shall say that the variables are uncorrelaie.d. When p ^ 0, a certain fraction of the variance of r; may be re- moved by the subtraction of a linear function of and vice versa. The maximum amount of the reduction increases according to (21 .7.2) in the same measure as q differs from zero. In this case, we shall say that the variables are correlated, and that the correlation is posi- tive, or negative according as (> > 0 or p < 0. When Q reaches one of its extreme values 1 1 , (21.7.2) shows that the residual variances are zero. We have shown in 21.2 that this case occurs when and only when the total mass of the (f, /;)-distribution is situated on a straight line, which is then identical with both regression lines (21.7.1). In this extreme case, there is complete func- tional dependence between the variables; when f is known, there is only one possible value for and conversely. Either variable is a linear function of the other, and the two variables vary in the same sense, or in inverse senses, according a8p=H-l or q = — 1. On account of these properties, the correlation coefficient g may be regarded as a measure of the degree of linearity shown by the (^', i;)-distribution. This degree reaches its maximum when p = ± 1 and the whole mass of the distribution is situated on a straight line. The opposite case occurs when (> = 0 and no reduction of the variance of either variable can be effected by the subtraction of a linear function of the other variable. ' It has been shown in 21.2 that in the particular case when ^ and 7} are independent we have g = 0. Thus tivo independent variables are always uncorrelated. It is most important to observe that the con- verse is not true. Two uncorrelated variables are not necessarily in- dependent. 278 21,7-8 Consider, in fact, a one-dimensional fr. f . g (x) 'which differs from zero only when x > 0, and has a finite second moment. Then /(a;, y) = g(Vx* + y”) 2 ^ V" a;* -f i 3 the fr. f. of a two-dimensional distribution, where the density of the mass is constant on every circle x* -f y* = c*. The centre of gra- vity is mj = Wj = 0, and on account of the symmetry of the distribu- tion we have = 0, and hence ^ = 0. Thus two variables with this distribution are uncorrelated. However, in order that the variables should be independent, it is by (15.11.3) necessary and sufficient that f[x,y) should be of the form f\{x)f^{y)j and this condition is not always satisfied, as will be seen e. g. by taking g[x) = e“^. If Q is the correlation coefficient of f and r}, it follows directly from the definition that the variables a ^ h and ri = crj -{■ d have the correlation coefficient q g sgn(ac), where sgn x stands for ± 1, according as x is positive or negative. In the particular case of a discrete distribution with only two possible values (xj, and y^, y^ respectively) for each variable, we find after some reductions, using the notations of 21.1, (21.7.3) Q — P\% Vt\ ^ py. Pi. jp.i p.i Sgn [(a*j — a:*) (yi — y,)] . 21.8. Linear transformation of variables. — Consider a linear transformation of the random variables ^ and r}, corresponding to a rotation of axes about the centre of gravity. We then introduce new variables X and Y defined by ( 21 . 8 . 1 ) ^ = (? "“ cos (p {t} — m,) sin gp, Y = —(^~ jWj) sin gp + (i? — wi,) cos gp, and conversely ( 21 . 8 . 2 ) f = w?i 4- X cos gp — y sin gp, f] = X sin gp 4- 3^ cos gp. If the angle of rotation gp is determined by the equation tg 2 gp = = — ^ , we find ^>20 fhi 279 31.S-9 E(X r) = cos 2 9 — i *“ f^oi) sin 2 5 P = 0, BO that X and Y are uncorrelated. In the particular case = = 0, when the equation for g) is undetermined, we have E (X Y) = 0 for any y. Thus it is always possible to express f and rj as linear functions of two uncorrelated variables. Consider in particular the case when the moment matrix M == 1^0 rank 1 (cf 21.2). We then have p = ± 1, and the whole mass of the distribution is situated on the line rj — = (5 — m,). Let us now determine the angle of rotation q> from the equation tg gp = ^p. From (21.8.1) we then find E ( Y*) = c\ sin* 5 p — 2 pa, sin jp cos y + aj cos* q) = (a^ sin q) — Q cos gp)* = 0. Thus the variance of Y is equal to zero, so that Y is a variable which is almost always equal to zero (cf 16.1). If we then put Y = () in (21.8.2), the resulting equations between rj and X will be satis< fied with a probability equal to 1. Thus two variables ^ and tj with a moment matrix M of ranJc 1 may^ with a probability equal to I, be expressed as linear functions of one single variable, 21.9. The correlation ratio and the mean square contingency. — Consider two variables ^ and rj with a distribution of the continuous type, such that the conditional mean mf(x) is a continuous function of X, In the relation (21.6.6) we put a = mj, /? = 0, and so obtain (21.9.1) a[ = E{f] - 7n,y = E(f^ - m,®)* + E(w,®- m,)*. We thus see that the variance of rj may be represented as the sum of two components, viz. the mean square deviation of rj from its con- ditional mean n2a($), and the mean square deviation of m|(^) from its mean m^. We now define a quantity by putting 00 (21.9.2) (?) — wi,)* =— , f(mt(x) — m,)*f,(x)dx. <r, ff, J 280 21.9 % is the congelation ratio^) of on £ introduced by K. Pearson. In the applications we are usually concerned with the square S*, and we may thus leave the sign of B undetermined. From (21.9.1) we obtain (21 .9.3) 1 - = A- E(ri- m, (?))*, or, and hence (21.9.4) We further write the equation of the first in. sq. regression line (21.7.1) in the form y — a + ^x, and insert these values of a and in (21.6.6). Using (21.7.2) and (21.9.3), we then obtain after reduction (21 .9.5) i = e* + ^ B (»», (?)-«-/? ?)*. It follows that = 0 when and only when m^[x) is independent of X, In fact, when (x) is constant, the regression curve y = Wg (x) is a horizontal straight line, which implies p = ^ = 0, and consequently The converse is shown in a similar way. — Further, (21,9.3) shows that Bl^z = 1 when and only when the whole mass of the distribution is situated on the regression curve y = (a:), so that there is complete functional dependence between the variables. For intermediate values of (21.9.3) shows that the correlation ratio may be considered as a measure of the tendency of the mass to accu- mulate about the regression curve. When the regression of rj on $ is linear, so that y = (a?) is a straight line, (21,9.5) shows that we have = p*, and (21 .9.3) reduces to the first relation (21.7.2). In such a case, the calculation of the correlation ratio does not give us any new information, if we already know the correlation coefficient q. In a case of non-linear regression, on the other hand, always exceeds p® by a quantity which measures the deviation of the curve y = [x) from the straight line of closest fit. The correlation ratio 0^^^ of ^ on 17 is, of course, defined by in- terchanging the variables in the above relations. The curve y = [x) is then replaced by the curve a: = m, (y). For a distribution of the discrete type, the correlation ratio may be similarly defined, replacing (21.9.2) and (21.9.3) by/ *) In the literature, the correlation ratio ia usually denoted hy the letter whicli obviously cannot be used here, since 7/ is a random variable. 281 21.9 (21.9.2 a) 8U = — '4 2/?/ . (w?!/' ■““ ' O 2 ■ “< 72 , (21.9.3 a) 1 - dl. == ~ £(>; - <7i» where pi . and are defined by (21.1.2) and (21.4.3) respective!}. The relations (21.9.4), (21.9.5) and the above conclusions concernino- the properties of the correlation ratio hold true with obvious modi- fications in this case. The correlation coefficient and the correlation ratio both serve to characterize, in the sense explained above, the 'decree of dependence between two variables. Many other measures have been ])roposed for the same purpose. We shall here only mention the ineaN square ern/- fii/qeucq introduced by K. Pearson. Consider two variables /; with a distribution of the discrete type as defined by (21.1.1), and sup])Ose that the number of possible values is finite for both variables. The probabilities pa then form a matrix with, say, ??? rows and it columns. Since any row or column consistiiifj exclusively of zeros may be discarded, we may suppose that every row and every column contains at least one positive element, so that the row sums ji , . and the co- lumn sums p. A are all positive. The lueau sqttare coufi iKjeunj of the distribution is then (21.9.t>) r 'S^ipn tt P‘P'- V ti.V'-p-'- — 1 . By (21.1.4), q)" — 0 when and only when the variables are indej)endent. On the other hand, by means of the inequalities pa Pt . and pj, it follows from the last expression that ^ q — 1 , where q ~ Min (>>?, it) denotes the smaller of the numbers tu and //, or their common value if both are equal. Further, the siejn of ecjuality holds in the last rela- tion if and only if one of the variables is a uniciuely determined *» 2 function of the other. Thus 0^ I, and the (juantitv ^ ^ q — I ' • q - 1 may be used as a measure, on a standardiz(*d scale, of the decree of dependence between the variables. In the particular case m = u — 2, we obtain after reduction 4 '/■ (?^n I'ii " 2 P[- Pi.Ps-P.iP-. 282 (21.‘).7) 21.9-10 Thus in this case tp* is the square of the correlation coefficient q given 2 by (21.7.3). We have here = 2, so that is identical with qp*. q — 1 ^ Further, gp® assumes its maximum value 1 only in the two cases Pi 2 =Pii = 0 or jpu == == 0. 21.10. The ellipse of concentration. — Consider a one-dimensional random variable ^ with the mean m and the s. d. a. If is another variable which is uniformly distributed (cf 19.1) over the interval (m — al^3, m + a 13), it is easily seen that has the same mean and s. d. as Thus the interval (??i — a Vs, m + aVs) may be taken as a geometrical representation of the concentration of the ^-distribution about its centre of gravity m (cf also 15.6). We now propose to find an analogous geometrical representation of the concentration of a given two-dimensional distribution about its centre of gravity (m,, For this purpose, we want to find a curve enclosing the point (mj, such that, if a mass unit is uniformly distributed over the area bounded by the curve, this distribution will have the same first and second order moments as the given distribu- tion. (By a »uniform distribution" we mean, of course, a distribution with a constant fr. f .) In this general form, the problem is obviously undetermined, and we shall restrict ourselves to finding an ellipse having the required property. In order to simplify the writing, we may suppose = nu = 0. Let the second order central moments of the given distribution be i“ 2 o» Pn i“o 2 - We shall suppose that we have 1, so that our distribution does not belong to the extreme type that has its total mass situated on a straight line. Consider the non-negative quadratic form By (11.12.3) the area enclosed by the ellipse (/ == c* is 7tc^/\ A, where -4 = ^ 11^52 — - If a mass unit is uniformly distributed over this area, the first order moments of the distribution will evidently be zero, while the second order moments are according to (11.12.4) 4 ■ A ’ 4 ' ^ and 4 A ■ It is required to determine c and the a * a such that these moments 283 21.10 C / / Fig. 24. Concentration ellipse and regression lines, p > 0. y — centre of gravity. QA = orthogonal na. sq. regression line. QB == m. sq. regres- sion line, on QC—m.sq. regression line, g on i]. coincide with /j, 2 o^ and respectively. It is readily seen that this is effected by taking = 4, and flr.i = ^08 M nt^ — M fho M where ilf = It will be seen that the form q(^,r}) thus obtained is the reciprocal (cf 11.7) of the form Q (I v) = Mio 2 ^ + Hot ri*. Returning to the general case of an arbitrary centre of gravity and replacing the by their expressions in terms of Uj, and p, it thus follows that a xinifoim distribution of a mass unit over the area enclosed by the ellipse 1 — \ a? or, a, al / has the same first and second order moments as the given distribution, — This ellipse will be called the ellipse of concentration corresponding to the given distribution. The domain enclosed by the ellipse (21.10.1) may thus be regarded as a two-dimensional analogue of the interval (m — aV 3, m -f aVs). When two distributions in JRj with the same centre of gravity are such that one of the concentration ellipses lies wholly within the other, the former distribution will be said to have a greater concentra- tion than the latter. This concept will find an important use in the theory of estimation (cf 32.7). 284 21 . 10-1 1 If we replace the constant 4 in the equation (21.10.1) by an arbitrary constant c*, we obtain for various values of c* a family of homothetic ellipses with the common centre (w^, which is identical with the family of ellipses of inertia considered in 21.6. The common major axis of the ellipses coincides with the orthogonal m. sq. regres- sion line of the distribution (cf 21.6). The ordinary m. sq. regression lines are diameters of the ellipses, each of which is conjugate to one of the coordinate axes. The situation is illustrated by Fig. 24. 21.11. Addition of independent variables. — Consider the two- dimensional random variables = and = We define the sum x = according to the rules of vector addition : * = (£. ■»?) = (?1 + Is, I?! + »?s)- By 14.5, a; is a two-dimensional random variable with a distribution uniquely determined by the simultaneous distribution of x^ and x^. Let us now suppose that x^ and x^ are independent variables ac- cording to the definition of 14.4, and denote by yi(^jw) and w) the c. f:s of x, x^ and x^ respectively. By the theorem (15.3.4) on the mean value of a product of independent variables we then have flp (f, u) = E *' *<^) ( 21 . 11 . 1 ) ^ = E (e* ^ • c* == gpj (^, m) [t, u). The generalization to an arbitrary number of terms is evident, and we thus obtain the same theorem as for one-dimensional variables (cf 15.12): The c.f, of a sum of independent variables is the product of the c.f’s of the terms. We shall now consider the case of a sum x — x^ x^ + + Xn, where the Xv = (§v, r;r) are independent variables all having the same two-dimensional distribution. We shall suppose that this latter distri- bution has finite moments of the second order f^n* the first order moments are zero: Wj = Wg = 0. If f>(t,u) is the c.f. of this common distribution of the we have by (21.3.3) (21.11.2) g>(t,u) == 1 — 4- ^^)- On the other hand, we have * = (^i -f + ■ + ^n) and X / g, 4- • • • + gw ly, 4- ■ • + r]n\ \' n \ Vn y n / 285 21.11 If gp„ (t, n) is the c. f. of xlVn, it thus follows from the above that we have g>„ {t, u) = n Substituting* in (21.11.!^) t/l n and tt/l' n for t and u, we obtain tpii {t, — 1 1 where, for any fixed i and //, the quantity d(w, /, ?/) tends to zero as /2 -> oo. Hence we obtain, in the same way as in the proof of the Lindeberg-Levy theorem in 17.4, (21.11 .3) lim (pn {t, ?/) ^ 11 ► oo Thus (p,i{t,n) tends for all t and w to a limit which is obviously con- tinuous for (f, w) “ (0, 0). By the continuity theorem for c. f:s proved in 10.7, we may then assert that this limit is the c. f. of a certain distribution which in its turn is the limit, for n - ^ oo, of the distri- bution of the variable xl\ n. Thus if AC,, JCjj, , . . are independent Uro-dimcnsional variables, all having the same distribution with finite second order moments and first ^ -l~ . . . _|- ^ order moments equal to zero^ the distribution of the variable ' . - V n alwai/s tends to a limiting distribution as ;/-► oo, and the c.f. of the limiting distribtition is given bg the second member q/‘ (21 .1 1 .3). — Ex- cept the trivial restriction — nu ~ 0, this is the two-dimensional generalization of the Lindeberg-Levy theorem of 17.4. It should be observed that, with respect to the second order mo- ments, we have here only assumed that these are finite. Now, given any quantities /z^o, /t,, and such that the quadratic form /^'20 ^ “1“ 2 t U “1~ is non-negative, it is possible (cf 21. -2) to find a distribution with Wj ™ = 0 and the given quantities for their second order moments. Taking this distribution as the common distribution of the ac,. in the above theorem, it follows that the expression in the second member of (21.11.3) is always the c.f. of a certain distribution, as soon as the quadratic form within the brackets is non-negative. If ac is a ^20 2 /X,, t H ifcos (5 {n, /, n) 2 n n 286 21 . 11-12 variable having this c. f., and if m — is a constant vector, the variable m x has the c. f. (21.11 .4) e' w)-i i The distribution corresponding to this c. f. is the tivo dimvimonal normal distribution, which will be further discussed in the following paragraph. 21.12. The normal distribution. — We now proceed to study the distribution corresponding to the c. f. (21.11.4). We shall have ta distinguish two cases according as the non-negative ([uadratic form Q '0 ^“ + 2 /tji tu flQ^ is definite or semi-definite positive (cf 11.10). In the former case, we shall say that we are concerned with a non-si ngu/ar normal disirihii- tion, whereas in the latter case we have a shigular normal distribution. When we use the expression 7iormal distribution without specification, it will always be understood that we include both kinds of distributions. We shall first consider the case of a defhiitr positive forni <?(/, w). Then the reciprocal form exists and has the expression (cf 21 . 10 ) M 1— (>“\ar Uz/’ where M = /^oa — (1 — From (1 1 . 1 2. 1 b) we now obtain OO 00 ^ ^ {t,f I U {X. tf) ffy ~ 2 7t y M OO — OO or, substituting x — for x and y — m^ for //, I 2 TT Cj Ujj I 1 The last relation shows that the function ( 21 . 12 . 1 ) f['f\y)^ 2 TT Oji CTg y 1 — p* OO OO {t x-i n If)- i Q ^ U-nij, If -Ills) ^ly ™ (nil t ^ Q (^ >0 — OO — OO 287 21.12 hs a iwo-dimemional fr.f. frith the c.f, ( 21 . 12 . 2 ) q)[t, u) = + The development (21.3.2) for the c.f. shows that the quantities nn and fjiik have, for this distribution, their usual signification as mean values and second order central moments. The function /(x*, y) defined by (21.12.1) is the normal fr.f. in two variables. It has a maximum point at the centre of gravity (w/^, m^. The homothetic ellipses (21 3 2 3 ) ^ _ 2 q {x — m;)(y — m^) ^ (y — m^Y that have already appeared in 21.6 and 21.10 in connection with the ellipses of inertia and of concentration of an arbitrary distribution, play in the case of a normal distribution the further role of equi- probability curves. For any point belonging to (21.12.3) we have, in fact, /(x, ?/)= Since by (11.12.3) the area of the 2 TT (Tj <T,j 1 1 — ring between the ellipses corresponding to c and r + r/c is 4 /r (Ti CTj, \ ^ c (I c. the mass situated in this ring is 2ce''^*dc, and thus the mass in the whole plane outside the ellipse (21.12.3) is (cf Ex. 15, p. 319) oo ^ 2ce~' flc -= The form of the equi probability ellipses (21.12.3) gives a good idea of the shape of the normal frequency surface e=f(x,y). For p == 0, ~ I'te ellipses are circles. As p approaches +1 or — 1, the ellipses become thin and needle-shaped, thus showing the tendency of the mass to accumulate towards the common major axis of the el- lipses, which is the orthogonal m. sq. regression line (cf 21.6) of the distribution. A variable (5, i?) with the fr.f. (21.12.1) is said to possess a non- singular normal distribution. The c. f. of the marginal distribution of i is then by (21.3.4) q>[t, 0) = Thus by 17.2 ^ is normal (»«i, Oj), with the marginal fr.f. 288 21.12 J\ (a:) = (ar-tHi)* By index permutation we obtain the corresponding expression for the marginal fr. £. (y) of rj. In the particular case when p = 0, it is seen that we have —fi{^)fi{y)y which implies that the variables are independent. For the normal distribution, it is thus legitimate to assert that two non-correlated variables are independent, though we have seen in 21 .7 that for a general distribution this may be untrue. The conditional fr. f. of ij, relative to the hypothesis ^ — x, is by (21.4.6) (21.12.4) /(.v|x) = f^',y) ^ 1 ^ f\ (•^') Oj 1^2 7r ( 1 — p*) This is a normal fr. f. in y, with the mean (x) = — Wj) and the s. d. I — p*. Thus the regression of t] on ^ is linear, and the conditional variance of rj is independent of the value assumed by — The analogous properties of the conditional distribution of § for a given value of 7] are deduced in the same way. When the non-negative form Q (t, u) is semi-defimte, the determinant M is zero, and no reciprocal form exists (cf 11.7 and 11.10). It fol- lows, however, from the preceding paragraph that the expression (21.12.2) is still the c. f . of a certain distribution, and this will be called a singular normal distribution. By 21.2, the total mass of this distribution is situated in a single point or on a straight line, ac- cording as the rank of the moment matrix M is 0 or 1. In such a case, it is evident that no finite two-dimensional fr. f . exists. Still, a singular normal distribution may always be regarded as the limit of a sequence of non-singular normal distributions. In order to see this, we may consider the sequence of non-singular nor- mal distributions corresponding to the given values of nii and w*, and the sequence of definite positive forms Q^(t,u)= Q[t,u) + el(t^ + «**), where Bv 0. The corresponding c. f:s tend, of course, to the limit (21.12.2) , and by the continuity theorem 10.7 the non-singular distri- butions then tend to the given singular distribution. 289 19—464 H.Cramir 21.12 Consider a singular normal distribution with a moment matrix M of rank 1. By 21.8, the corresponding variables ^ and rj may, with a probability equal to 1, be represented as linear functions of a single variable X. Conversely, X is a linear function of § and rj, and the c. f. of X is then of the form so that X is normally distri- buted. The case when M is of rank 0 may be regarded as the limiting case a = 0, and we thus have the following result: A two-dimensional singular normal distribution may he regarded as an ordinary one-dimensional normal distribution on a certain straight line in the plane. When m, — — 0, we obtain from (12.6.8) the following expansion of the nor- mal fr. f. in powers of q: ( 21 . 12 . 6 ^ . V -- - Ox (Ji 0 The series may be integrated term by term, and we dednee a corresponding expression for the normal d, f. ( 21 . 12 . 6 ) ;r //.. 00 \ \ / = v . >./ W, "L 0 rl — 00 — 00 For j‘ = y — 0 we obtain from (21.12.6) V ^ L_ o' and hence by integration with respect to p “ [(P*”’ (O')]* „ 1 / dr 1 V ^ - -- I _ = — arc sin o. " vl 27t{ Vi—r* Now (21.12.6) gives 0 0 j f f du dv = i “ arc sin p. — GO —00 By the symmetry properties of the fr. f. /(x, y), it then follows that in each of the first and third quadrants of the (r, ^)-plane we have the mass i + ^ arc sin p, 2 7t while each of the second and fonrth quadrants contains the mass ^ arc sin p. These relations are due to Stieltjes, Kef. 220, and Sheppard, Ref. 211. 290 22.1 CHAPTER 22. General Properties op Distributions in J?„. 22.1.‘ Two simple types of distributions. Conditional distributions. — The joint probability distribution (cf 14.2) of n one-dimensional random variables . . . , is a distribution in the ^?-dimen8ional space Rn, with the variable point « = (J, , . . . , The probability function (cf 8.4) of the distribution is a set func- tion P{S) = P{x<i S), which for any set S in Rn represents the prob- ability of the relation x <. S. The distribution function^ on the other hand, is a function of n real variables defined by the relation (8.3.1): F{Xy , . . . , aJn) = P(?, ^ Xj , . . . , ^ Xn). The distribution is uniquely defined by either function P or F. As before, we shall make a frequent use of our mechanical illustra- tion, interpreting the probability distribution by means of a distribu- tion of a unit of mass over Hn. If we pick out a group of k variables , . . ., and project the mass in the original ? 2 -dimensional distri- bution on the X- dimensional subspace of these variables, we obtain (cf 8.4) the k-dimensional marginal distribution of . . . , The corresponding marginal d. f. is obtained, as in the two-dimensional case, by putting the v — k remaining variables in F equal to -f qo . Thus in particular the marginal d. f. of the single variable is Fi[x) = P(a:, oo, . . oo), and similarly for any As in the cases « = 1 and w = 2 (cf 15.2 and 21.1), we now intro- duce the two simple types of distributions: the discrete and the cow- tinuous type. The definitions and properties of these are directly analogous to those given in 21.1, and we shall here only add some brief comments. For a distribution of the discrete type, we have on the axis of each fv a finite or enumerable set of points Xviy which are the discrete mass points of the marginal distribution of ^v. The total mass of the w-dimensional distribution of * = (^j, . . ., ^n) is then con- centrated in the discrete points (a^n,, . . Xnr„), each of these points carrying a mass pi, . . ^ 0, so that P (?1 ~ — pii , . .in'> ' in 291 2S.1-2 The marginal distribution of anj group of i variables is also of the discrete type, and the corresponding p:B are obtained in a similar way as in (21.1.2) and (21.1.3), by summing . over all values of the n — Jc remaining variables. For a distribution of the continuous type, the d. f. F is everywhere continuous, and the probability density or frequency function (cf 8.4) /(ar ,, . = - d^F dx^ . , .0 Xn exists and is continuous everywhere, except possibly in certain points belonging to a finite number of hypersurfaces in Rn, The differential f[xiy . . . , Xn) dx^ . . . dxn will be called the probability eleynent (cf 15.1) of the distribution. The fr. f. of the marginal distribution of any group of k variables is obtained by integrating /(a:j, . . ., Xn) with respect to the n — k remaining variables, as shown for the two- dimensional case by (21.1.5) and (21.1.6). When have a distribution of the continuous type, the conditional fi\f of ?i, . . relative to the hypothesis = . . ., ~ ^T/,, is given by the expression generalizing (21.4.10): ( 22 . 1 . 1 ) f(Xy, . . Xk\ Xk-\-U . . 0Cn) = ^ /(Xl, . . ., Xn) OO OO ) b*’ » ^k + 1 > • • • 1 ^?») ^ • • • d^k Finally, let us consider two variables * — (bl » • • • » ^m) and y = (^j , • • • » Vn) such that the (m 4- 72)-dimensional combined variable (*, y) has a distribution of the continuous type. In generalization of (21.1.7) we then find that a necessary and sufficient condition for the inde- pendence of X and y is (22.1.2) /(Xj, . . ., Xm, Vi, . ■ ■,yn) . . . , (,V, , . . .,y„), where /, /j and /, are the fr. frs of x and y respectively. The geneitilization to any number of variables x, y, . . . is immediate. 22.2. Change of variables In a continuous distribution. — Let « = , . . . , ^n) be a random variable in An, and consider the m functions ( 22 . 2 . 1 ) • ••.?»), (* = 1, 2, . . ., m), 292 22.2 where m is not necessarily equal to According to 14.5, the vector y = (i;i , . . . , fjm) then constitutes a random variable in a space Rm of m dimensions, with a probability distribution uniquely determined by the distribution of x. We shall here only consider the particular case when m = 77 , and the x-distribution belongs to the continuous type. If the functions Qi satisfy certain conditions, the yr-distribution may then be explicitly determined, as we are now going to show. Let us assume that the following conditions A) and B) are satisfied for all X such that the fr. f. . . .,u:») is different from zero: A) The functions (ji are everywhere unique and continuous, and have continuous partial derivatives in all points x, except possibly in certain points belonging to a finite number of hypersurfaces. B) The relations (22.2.1), where we now take rn = ??, define a one- to-one correspondence between the points x == . . . , and y == ‘ so that we have conversely = In {r]iy . . r]n) for 7 = 1, . . ., a, where the //* are unique. Consider a j)oint x which does not belong to any of the exceptional hypersurfaces, and is such that the Jacobian ’..W • 1 sii) d fji is different from zero. The Jacobian of the inverse transformation, to X, since we have t \ >^ Tt ) Vn) <) Vk is then finite in the point y corresponding (^1 , . . . , r;,,) _ ( g|, ■ ■ ■ , ;») When S is a sufficiently small neighbourhood of x, and T is the corresponding set in the y-space, J is finite for all points of 7\ and we have (22.2.2) P{S)=- //(x, , . . . , x„) (lx, . . . c/.r„=j7(.r, , . . . , ./•„) | J| dy , . . . dy,, S T where in the last integral the r, should be replaced by their expres- sions Xi = hi{y^, . . ., //„) in terms of the //,. The probability element of the x distribution is thus transformed according to the relation (22.2.3) /(.J'l, • • .,x„)<lx^ . . . dx„ . . .,x„) 1 J) c///, . . . dy„, 293 22.2-3 where in the second member Xi — //«). The fr. f. of the new variable y = (r]i, . . rjn) is thus , . . . , .r„) | J\. When /? = 1 , and the transformation r; — g (i) or £ — A [rj) is unique in both senses, (22.2.3) reduces to f{x) tix =/[// (;?/)] I II M I (/if, where the coefficient of dg is the fr. f. of the variable rj. An example of this relation is given by the expression (If). 1.2), which is related fi — h to the linear transformation in — a £ + A, or £ — a Suppose DOW that the conditiou B is not satisfied. To each point Jc, there still eorrespond.s one and only one point y, but the converse transformation is not unique to a given y there may correspond more than one x. We then have to divide the *-space in several parts, so that in each part the correspondence is unique in both senses. The mass carried by a set T in the y-space will then he e(|ual to the sum of the contributions arising from the corresjmnding sets in the various parts of the x-space. Each contribution is represented by a multiple integral that may be trans- formed according to (22.2.2), and it thus follows that the fr. f. of y now assumes the form 1 .7^ |, w'here the sum is extended over the various points x corresponding to a given y, and and are the corresponding values of / and J. In the case /? = 1, an example of thi.s type is afforded by the transformation considered in 16.1. The expression (16.1.4) for the fr. f. is evidently a special case of the general expression |./, |. — A more complicated example will oc.cnr in 29.3. 22.3. Mean values, moments. — The mean value of a function • • - 7 ?«) integrable over Rn with respect to the ^/ dimensional pr. f. P{S) has been defined in (15.3.2) by the integral . • •. ?") = / r„) dP. The moments of the distribution (cf 9.2 and 21.2) are the mean values (22.3. 1 ) i';" ) == j xi ' . . . d P, where + Vn is the order of the moment. For the first order moments we shall use the notation = - f x,dP. K 294 22.3 The point m = (m^ , . . . , m„) is the centre of gravity of the mass in the //-dimensional distribution. The central moments or the moments about the point m, are obtained by replacing* in (22.3.1) each power by (ft — mf *. The second order ventral moments play an important part in the sequel, and whenever nothing is explicitly said to the contrary, we shall always assume that these are finite. The use of the ^ notation for these moments would be somewhat awkward when n > 2, owing to the large number of subscripts required. In order to simplify the writing, we shall find it convenient to introduce a particular notation, putting (22.3.2) Xu =- a) = E (|, - m ,)*, hk = Qik (T,aK = E — mj){^k — nik). Thus hi denotes the variance and n, the s. d. of the variable while ?,,k denotes the covariance of and The correlation coefFicient hk Oil - Oi Ok is, of course, defined only when Ci and Ok are both positive. Obviously we have hi^hk, Qki — Qik and (>//=!. — In the par- ticular case V = 2, we have hi 1 ^ 20 ^ ^12 ” ^22 - /'o 2 - In generalization of (21.2.5), we find that the mean value (22.3.3) k ti tk is never negative, so that the second member is a non-negative qua- dratic form in , . . . , The matrix of this form is the moment matrix while the form obtained by the substitution it — corresponds to the correlation matrix Qu - Qn \ • • • Qnit which is defined as soon as all the ct, are positive. 29,5 22.3-4 Thus the symmetric matrices A and P are both non-negative (cf 11.10). Between A and P, we have the relation A==^XP2: where S denotes the diagonal matrix formed with Oj , . . . , cTh as its diagonal elements. By 11.6, it then follows that A and P have the same rank. For the corresponding determinants = | | and P = | |, we have A =^a\ . , . alP, From (11.10.3) we obtain (22.3.4) 0 ^ ^ S Ah . . . 0 S. P ^ Qn ‘ Qnn = 1. In the particular case when A, a = 0 for i ^ Jc, we shall say that the variables , . . . , are tin cor related. The moment matrix A is then a diagonal matrix, and A = . . . A^,r^. If, in addition, all the Oi are positive, the correlation matrix P exists and is identical with the unit matrix /, so that P== 1. Moreover, it is otihf in the uncorre- lated case that we have ^ = A^ . . . Ann and P —1. 22.4. Characteristic functions. — The c. f. of the }i dimensional random variable == (§i, . . . , 5„) is a function of the vector t = (if,, . . ., f„), defined by the mean value (cf 10.6) g»(t) = = fe*t'*dP, K where, in accordance with (11.2.1), — -f • + The pro- perties of the c. f. of a two-dimensional variable (cf 21.3) directly extend themselves to the case of a general n. In particular we have in the neighbourhood of t = 0 a development generalizing (21.3.2) (22.4.1) g) (f) = ^1 4 ^ /a + o tj If m = 0, this reduces to (22.4.2) gp (*) = 1 — i 2 " (.2 ^^) • The semi -invariants of a distribution in' n dimensions are defined by means of the expansion of log q) in the same way as in 15.10 for the case n — As in 21.3, it is shown that a necessary and sufiPicient condition for the independence of the variables x and y is that their joint c. f. is of the form gp (f , u) — gp, (t) q)^ (u). 296 22.4-5 The c. £. of the marg^inal distribution of any group of h variables picked out from . . . , is obtained from ^(t) by putting U = 0 for all the n — k remaining variables. Thus the joint c. f. of is (22.4.3) E (e' <'■ '=• + •■■ +'* i*)) = gp (<, 0). 22.5. Rank of a distribution. — The rank of a distribution in (Frisch, Ref. 113; cf also Lukomski, Ref. 151) will be defined as the common rank r of the moment matrix A and the correlation matrix P introduced in 22.3. The distribution will be called svigular or noh singular, according as < n or r~n. In the particular case n — 2, A is identical with the matrix M considered in 21.2. It was there shown that the rank of M is directly connected with certain linear degeneration properties of the distribu- tion. We shall now prove that a similar connection exists in the case of a general n. A distribution in R„ is 7ion~singular tvhen and only when there is no hifperplane in Rn that contains the toUtl mass of the distribution. In order that a distribution in Rn should be of rank r, where r < it is necessary atul sufficient that the total mass of the distribution should belong to a linear set Lr of r dimensions, but not to any linear set of less than r dimens^ions. Obviously it is sufficient to prove the second part of this theorem, since the first part then follows as a corollary. We recall that, by 3.4, a linear set of r dimensions in Rn is defined by n — r indepen- dent linear relations between the coordinates. Suppose first that we are given a distribution of rank r < )i. The quadratic form of matrix A (22.5. 1) Q (t) =^'2i^,kt,tk = E t, ■ i.k I ' is then of rank r, and accordingly (cf 11.10) there are exactly n — r linearly independent vectors tp ~ . . . , t^f^) such that Q[ti) = 0. For each vector tp, (22.5.1) shows that the relation (22.5.2) 2 ~ ^ I must be satisfied with the probability 1 . The n — r relations corres- ponding to the n — r vectors tp then determine a linear set Lr con taining the total mass of the distribution, and since any vector t 297 22.5-6 such that (t) = 0 must be a linear combination of the there can be no linear set of lower dimensionality with the same property. Conversely, if it is known that the total mass of the distribution belongs to a linear set Lr, but not to any linear set of lower dimen- sionality, it is in the first place obvious that Lr passes through the centre of gravity m, so that each of the n — r independent relations that define Lr must be of the form (22.5.2). The corresponding set of coefficients then by (22.5.1) defines a vector fy; such that 0(fy,) = 0, and since there are exactly n — r independent relations of this kind, Q(t) is by 11.10 of rank r, and our theorem is proved. Thus for a distribution of rank r < n, there are exactly n — r independent linear relations between the variables that are satisfied with a probability equal to one. As an example we may consider the case w = 3. A singular distribution in is of rank 2, 1 or 0, ac- cording as the total mass is confined to a plane, a straight line or a l)oint, and accordingly there are 1, 2 or 3 independent linear relations between the variables that are satisfied with a probability equal to one. 22.6. Linear transformation of variables. — Let f ^ , . . . , be random variables with a given distribution in Jl„, such that m = 0. Consider a linear transformation n (22.G.1) ^ Ciu (*• = 1 , 2, . . . , m). k-l with the matrix C— Cm,i — {c,jk}, where /?/ is not necessarily equal to u. In matrix notation (cf 11.3), the transformation (22.b.l) is simply y = Cx. This transformation defines a new random variable J • • • > with an m-dimensional distribution uniquely defined by the given dimensional distribution of x (cf 14.5 and 22.2). Obviously every r]i has the mean value zero. Writing A/ — JE (5, ^a), = E(r]i7]k), we further obtain from (22.6.1) n k ^ Cl r Ckn • r,8=l This holds even when m ^ 0, and shows that the moment matrices A = A„n={^ik} and M = Mmm -=■ satisfy the relation (22.6.2) M=CAC\ If, in the c. f. <p{t) of the variable x, we replace . . . , fn by new 298 22.6 variables u^, . . .^Um by means of the contragfredient transformation (cf 11.7.5) t=C'u, we have by (11.7.6) t* x — u y, and thus {22.6.3) (p[t) = E{o^'») = Eiei^'y) = i^(w), where 'tpiu) = . . ., Um) is the c. f. of the new variable y. From (22.6.2) we infer, by means of the properties of the rank of a product matrix (cf 11.6), that the rank of the y distribution never exceeds the rank of the x distribution. Consider now the particular case m = n, and suppose that the transformation matrix C = Cnn is non-singular. Then by 11.6 the matrices A and M have the same rank, so that in this case the transformation (22.6.1) does not affect the rank of the distribution. — Let us, in particular, choose for C an orthogonal matrix such that the transformed matrix M is a diagonal matrix (cf 11.9). This im- plies fiiic = 0 for i 7 ^ k, so that . . . , are uncorrelated variables {cf the discussion of the case n == 2 in 21.8). In this case, the reci- procal matrix exists (cf 11.7), and the reciprocal transformation x= C~^y shows that the may be expressed as linear functions of the t],. If the x-diatribution is of rank r, the diagonal matrix M contains exactly r positive diagonal elements, while all other elements of M are zeros. If r < w, we can always suppose the so arranged that the positive elements are jUn, . . . , firr. For t = r + 1, . . . , w, we then have fia ~ E{t]^) = 0, which shows that rji is almost always equal to zero. Thus we have the following generalization of 21.8: If the distribution of n variables fi, . . . , of rank 7', the may with a probability equal to 1 be expressed as linear functions of r un- correlated variables . . ., The concept of convergence in probability (cf 20.3) immediately ex- tends itself to multi dimensional variables. A variable jt = (5i, . . Sn) is said to converge in probability to the constant vector a = (aj, . . ., Ow) if converges in probability to ai for ? = 1, . . ., t?. We shall require the following analogue of the convergence theorem of 20.6, which may be proved by a straightforward generalization of the proof for the one-dimensional case: Suppose that we have for every r = 1, 2, . . . . y^ = Axv-^x^, where yv and Xr are n dimensional random variables^ while A is a matrix of order n • n with constant elements. Suppose further that, as 299 22.6-7 the n-dimenmonal distt'ihution of Xv tends to a certain limiting distribution, while converges in probability to zero. Then has the limiting distribution defined by the linear transformation y = Ax, where X has the limiting distribution of the Xy. 22.7. The ellipsoid of concentration. — The definition of the ellipse of concentration given in 21.10 may be generalized to any number of dimensions. Let the variables . . . , have a non singular distribu^ tion in Rn with m == 0 and the second order central moments likr and consider the non negative quadratic form Kk If a mass unit is uniformly distributed (i. e. such that the fr. f. is constant) over the domain bounded by the n-dimensional ellipsoid q = c*, the first order moments of this distribution will evidently be zero, while the second order moments are according to (11.12.4) An n 2 ’ A («, A = 1, 2, . . . , n). It is now required to determine c and the aa- such that these mo- ments coincide with the given moments It is readily seen that this is effected by choosing, in generalization of 21.10, c* = w + 2^ and Aki Aik Thus the ellipsoid (22.7.1) 3(5., . . . , 5„) = + 2 i,k has the required property. This will be called the ellipsoid of con- centration corresponding to the given distribution, and will serve as a geometrical illustration of the mode of concentration of the distribu- tion about the origin. The modification of the definition to be made in the case of a general m is obvious. - When two distributions with the same centre of gravity are such that one of the concentration ellipsoids lies wholly within the other, the former distribution will be said to have a greater concentration than the latter. The quadratic form q appearing in (22.7.1) is the reciprocal of the form 300 22.7-23J i,k (Since ^ is a symmetric matrix, we may replace by in the elements of the reciprocal matrix as defined in 11.7.) The n-dimensional volume of the ellipsoid (22.7.1) has by (11.12.3) the expression n n n n [n 4- 2)2 7C^ , (n + 2Y rc^ . " ■ . r 7 ~ / + ll / i (2^0 . . CTrtV /^ where the determinants -^ = |A*jt| and P==|ptjt| are both positive, since the distribution is non-singular. When ...,aw are given, it follows from (22.3.4) that the volume reaches its maximum when the variables are uncorrelated (P = 1), while on the other hand the volume tends to zero when the pa* tend to the correlation coefficients of a singular distribution. The ratio between the volume and its maximum value is equal to VP \ this quantity has been called the scatter coefficient of the distribution (Frisch, Ref. 113). It may be regarded as a measure of the degree of » non- singularity » of the distribution. — For « = 2, we have V~P = 1^1 — p*. On the other hand, the square of the volume of the ellipsoid is proportional to the determinant A = a\ . . . aj P, and this expression has been called the generalized variance of the distribution (Wilks, Ref. 232). For n = \ , A reduces to the ordinary variance a*, and for )i ~ 2 we have A ~ a\a\{\ — p*). We finally remark that the identity between the homothetic families generated by the ellipses of concentration and of inertia, which has been pointed out in 21.10 for the two-dimensional case, breaks down for n > 2. CHAPTER 23. Regression and Correlation in n Variables. 23.1. Regression surfaces. — The regression curves introduced in 21.5 may be generalized to any number of variables, Ji^hen the distri- bution belongs to one of the two simple types. Consider e. g. n vari- ables ?i, . . with a distribution of the continuous type. The cow- 301 23.1-2 ditional mean value of relative to the hypothesis = Xi for t = 2, . . . , w, is OO Xn)dx^ = a;*, . . = x„) = mi (a;*, . . x„) = ^ jf{xi, . . a:„)da:i The locus of the point (mj, XiiS for all possible values of Xg, . . x„ is the regression surface for the mean of and has the equation X, = mj (Xg, . . Xn), which is a straightforward generalization of (21.5.2). 23.2. Linear mean square regression. — We now consider n vari- ables ■ •.,?/» with a perfectly general distribution, such that the second order moments are finite. In order to simplify the writing, we shall further in this chapter always suppose m = 0. The formulae corresponding to an arbitrary centre of gravity will then be obtained simply by substituting — mt for in the relations given below. The mean square regression plane for with respect to will be defined as that hyperplane (23.2.1) ?i=^12-34 . n?2 + ^13-24 . » ^8 + • • + w • 23 which gives the closest fit to the mass in the /^ dimensional distribution in the sense that the mean value (23.2.2) — /^12 ■ 34 . . . n — i^ln ■ 23 . . . ?n)* is as small as possible. Thus the expression on the right hand side of (23.2.1) is the best linear estimate of in terms of . . ., §n, in the sense of minimizing (23,2.2), We may here regard ^g, . . as in- dependent variables, and as a dependent variable which is approxi- mately represented, or estimated, by a linear combination of the in- dependent variables. In a similar way we define the m. sq. regression plane for any other variable in which case of course takes the place of the dependent variable, while all the remaining variables . . ., ^t-i) ^/ + it • • •} a*re regarded as independent. For the regression coefficients^) /?, we have here used a notation Often also called partial regression coefficients. 302 23.2 introduced by Yule (Ref. 251). Each has two primary subscripts followed by a point, and then n — 2 secondary subscnpts. The first of the primary subscripts refers to the dependent variable, and the second to that independent variable to which the coefficient is at- tached. Thus the order of the two primary subscripts is essential. The secondary subscripts indicate, in an arbitrary order, the remaining* independent variables. — Sometimes, when no misunderstanding seems possible, we may omit the secondary subscripts. In order to determine the regression coefficients, we difiFerentiate the expression (23.2.2) with respect to each of the n — 1 unknown coefficients and then obtain the n — 1 equations + As + ■■ + k'ln^ln — ^21 > ^S 2 As ^88 As ‘ ■ * "i" ^3 n A “ ^81 kn 2 As + ^n 3 As 'i" ' ■ “1“ kji n A w kfi 1 , where we have omitted the secondary subscripts, thus writing (iik instead of the complete expression it . 23 . . . jt-i, jt+i . . n. The deter- minant of this system of equations is the cofactor of A,i in the determinant A = \kik |. Let us first suppose that the jr distribution is non singular (cf 22.5). The moment matrix A and the correlation matrix P are then definite positive, so that -^11 ^ 0 , and by (11.B.2) our equations have a unique solution (23.2.3) Oi Pit Ok Pn * By simple permutation of indices we obtain the corresponding ex- pression (23.2.4) Pik Ok Pi i for the coefficient in the regression plane for $*. The omitted secondary subscripts are here, of course, aU the numbers 1,2,...,?? with the exception of i and i, while the Aik and Pa are cofactors in A and P. In a non-singular distribution^ the regression plane for each vaHable with respect to all the others is thus uniquely determined^ and the re- gression coefficients are given by (23.2.4). — In the particular case of n 303 23.2 uncorrelaied variables, it follows that all regression coefficients are eero^ since toe have An ^0 for i 7 ^ k. Suppose now that the ^-distribution is singular, with a rank r < n. We then may hare Aii = Oy and accordingly some regression coef- ficients may be infinite or undetermined. As an example, we may con- sider the case n = 3. For a distribution of rank 2, the total mass is situated in a certain plane. As long as this plane is not parallel to one of the axes, it is then obvious that all three regression planes will coincide with this plane, so that all regression coefficients are finite and uniquely determined. Tf, on the other hand, the plane is parallel to one of the axes, e. g. the axis of ^|, the two-dimensional marginal distribution of total mass confined to a straight line. Now the moment matrix of this marginal distribu- tion has the determinant A^^ and thus we have Aii = 0. In this case, we may say that the regression plane for is parallel to the axis of ^ 1 , so that at least one of the regression coefficients /7i2 . s and |5 i 3.2 is infinite. — For a distribution of rank 1 or 0, on the other hand, the total mass belongs to a certain straight line or to a certain point. Each regression plane must then contain this line or point, but is otherwise undetermined. As in 21.6, we can show that the m.sq. regression plane (23.2.1) is also the plane of closest fit to the regression surface Xi = Wj (x^, . . ., Xn), for all distributions such that the latter exists. If it is known that the regression surface is a plane, this plane must thus be identical with the m. sq. regression plane. Consider next a group of any number h < n ot the variables say Sjj * • -1 The /i-dimensional marginal distribution of these vari- ables has a moment matrix which is a certain submatrix A* of A, We can then form the regression plane of with respect to ...,?<?» and the regression coefficients will be given by expressions analogous to (23.2.4), where An and Aik are replaced by the corresponding co- factors from the determinant A'^ = \A*\. — If, in particular, we consider the group of the n — 1 variables . . ., §;-i, we obtain (23.2.5) = Ajj.ii where the omitted secondary subscripts are the numbers 1, 2, . . ., w, with the exception of i, j and k, while Ajj.ik is the cofactor of I/jt in Ajj (cf 11.5.3). 304 23.3-4 23.3. Residuals. — Suppose -4,, 0. The difference (23.3.1) 171.23 = ^ln?w, where the regression coefficients /?jit are given by (23.2.3), may be considered as that part of the variable 5 ,, which remains after sub- traction of the best linear estimate of in terms of § 3 , . . ., This is known as the residual of with respect to . . ., ^n. The residual is uncorrelated ivith any of the » subtracted i^ariahles We have, in fact, introducing the expressions of the /9:s, (23.3.2) 171-23 . . w — - j - 2 -^1 ^ ^11 Hence E ( 171 . 23 . >») = 0, and 1 (23.3.3) E(^, 171 . 03 . „) = ' — for/=l, ^11 0 for i — 2, 3, . . ., 17. It follows that the residual variance ai.^a .n — -^( 171 . 23. n) is given by ‘ k-i (23.3.4) a\ „ - E(t“x »?i ■ aa ...«) = “= o'? ^ ■ and further that the two residuals 171 . 23 . ,n and , (j are nncor- related, provided that all subscripts z, y, . . q of the latter occur among the secondary subscripts of the former. The residual variance al. 2 . 3 ...n may, of course, be regarded as a measure of the greatest closeness of fit that may be obtained when we try to represent by a linear combination of . . ., — In the case w = 2, the expression (23.3.4) reduces to a ?.2 = ( 7 ?(l — p*), in accordance with (21.7.2). 23.4. Partial correlation. — The correlation between the variables and ^2 is measured by the correlation coefficient which is some- times also called the total cor7'€latio77 coefiicient of and fg. If and S 2 4ire considered in conjunction with n — 2 further variables fj, . . ., we may, however, regard the variation of and fg as to a certain extent due to the variation of these other variables. Now the residuals 171. 34 . .w and 172-34 71 represent, according to the preceding paragraph, those parts of the variables and respectively, which remain after subtraction of the best linear estimates in terms of ^g, . . ., ^n- Thus we may regard the correlation coefficient between these two resi- 305 20-464. H.Cramir 23.4 duals as a measure of the correlation between and §2 ^fter re- moval of any part of the variation due to the influence of I., • . ?n. This will be called the partial correlation coefficient of and §2, with respect to gj, . . and will be denoted by ^12 -34. . .n- Here the order of the subscripts is, of course, immaterial for primary as well as for secondary subscripts. — We thus have (23.4.1) E { f]i . 34 ■ n . 34 . . . n) This expression being an ordinary correlation coefficient between two random variables, we must have — 1 ^ ^12 -34 .71 ^ 1. The residuals 171.34.. » and 172* 34 ...n may be expressed in a form analogous to (23.3.2), if we make use of the expression (23.2.5) for the regression coefficients in a group ofn — 1 variables. We then obtain the two following relations analogous to (23.3.4) -E (171 . B4 . , „) = E (^1 171 .34 n) = » ^ 22-11 ^ 11*22 £?(i? 5.S4. «)=£(?, 122. 34 n) = ~ > . 23 and further 1 ^ .yi. E{7]i,34 . n17^i*34 . . . »i) = .^(si ^2-34 . . w) ~ k yill ‘ 2k "Z — ^11 . 22 ^ -^11 • 22 2 Inserting these expressions in (23.4.1) we obtain the simple formula (23.4.2) Pl 3 -34 . . n — — 11 By index permutation we obtain an analogous expression for the partial correlation coefficient of any two variables and with respect to the n — 2 remaining variables. It is thus seen that any partial correlation coefficient may be ex- pressed in terms of the central moments A^., or the total correlation coefficients Qik of the variables concerned. Thus we obtain, e.g., in the case w = 3 (23.4.3) ^ gi3 gsa In the particular case of n uncorrelated variables, it follows from (23.4.2) that all partial correlation coefficients are, like the corresponding 306 23.4-5 total correlation coefficients, equal to zero. We thus have, e. g., ^12 - 34 ... n = ^18 = 0. As soon as there is correlation between the vari- ables, however, ^ 12. 34 n is in general different from It is, e. g., easily seen from (23.4.3) that ^,2 and (> 12. 3 may have different signs, and that either of these coefficients may be equal to zero, while the other is different from zero. When all total correlation coefficients Qik are known, the partial correlation coefficients may be directly calculated from (23.4.2) and the analogous explicit expressions obtained by index permutation. The numerical calculations may be simplified by the use of certain recurrence relations, such as (23.4.4) (>12 • 34 . £12 ,34 . w-l pi n • 34 . w-1 p i 7> ■ 34 . . . 7I~1 ^^(l — Pl«*34. 7J-l)(l — P2ri.-34. . n-l) (cf Ex. 11, p. 319), which shows an obvious analogy to (23.4.3). By this relation, any partial correlation coefficient may be expressed in terms of similar coefficients, where the number of secondary subscripts is reduced by one. Starting from the total coefficients pi it, we may thus first calculate all partial coefficients Qij.k with one secondary subscript, then the coefficients Qij^ki with two secondary subscripts, etc. Farther, when the total and partial correlation coefficients are known, any desired residual variances and partial regression coefficients may be calculated by means of the relations (cf Ex. 12 — 13, p. 319) ■ 23 . 71 = ( 1 — Pll) f 1 Pl3 • 2 ) ( 1 (23.4.5) ^12 -34 . . 71 “ P 12 • 34 £14 - 23 ) ... (1 — pi 71 *23 7i“l), Oi . 34 . 71 I “■ ------ 5 (72 * 34 . 7 » and the analogous relations obtained by index permutation. It will be seen that these relations are direct generalizations of (21.6.9) and (21.6.10). — From the last relation we obtain (23.4.6) Pl2 -34 . . . 71 = /?12'34 . . 7>. 23.5. The multiple correlation coefficient. — Consider the residual defined by (23.3.1) 'fJl Ti . 71 ” fii2 §8 ‘ 77 ” ^*1 ?I , where = Aa ?8 + ' best linear estimate of in terms of ^'27 • • -7 ^ 71 * It is easily shown that, among all linear combinations 307 23.5 of ^2, . . it is that has the maximum correlation with as measured by the ordinary correlation coefficient. The correlation coef- ficient of the variables and may thus be regarded as a measure of the correlation between on the one side, and the tolality of all variables on the other. We shall call this the multiple cor- relation coefficient between i', and • • •, 5/i), and write (23.5.1) if I (23 . . . u) By (23.3.3) and instead of r]i . 03 (23.3.4) we have, however, writing yin ’ for simplicity and thus (23.5.2) i ’ By (11.10.2) we have .so that E(S^^l) -0, and 0 ~ (fi li.f fi) 1 . When (^iic23 >^1— 1, the variable is ^almost certainly eciual to a linear combination of ^'2, . . , This means that the total mass of the joint distribution of all n variables is confined to a certain hyper- plane in R,,^ so that the distribution is singular, and we have A — P — 0, in accordance with (23.5.2). On the other hand, for a non-singular distribution it follows from the develo])ment (11.5.3) that we have UJ I'J.s 2 • 1 1 (0 /(?!/., where the sum in the second member is, by 11.10, a definite positive ijuadratic form in the variables pjg, - > Thus (^>1(23 . — 0 when and only when Pia = ~pin -=0 i- c. wlien is uncorrelated with every for ? — 2, 3, . For the numerical calculation, it is convenient to use the relation (cf Ex. 13, p. 319) (2:j.5.3) 1‘2:« n> — ui . aos 23.6 23.6. Orthogonal mean square regression. ~ The orthogonal m. sq. regression line introduced in 21.6 may be generalized to any number of variables. A hyper- plane H passing through the centre of gravity m = 0 of our w dimensional distribu- tion has the equation ^1 + fiz ^2 ^ fin Sn “ nbcre /?].•• -t fin denote the generalized direction cosines of the normal to the plane, so that S — 1* The square of the distance between If and the point x ~ (|j. . . ., is rZ* - us try to find If such that the mean value Eld^) becomes as small us possible. If such a hyperplane If exists, it will be called an orthogonal m.nq. regression plane of the distribution (cf K. Pearson, Ref. 183 a). For a distribution of rank less than n, the problem is trivial, since the whole mass belongs to a certain hyperplane JET, which must then yield the value E[(t-)=^ 0, We may thus suppose that the redistribution is non-singular, which by 11.9 implies that the characteristic numbers of the moment matrix A are all positive. Let denote the smallest of the characteristic numbers, and let rcj, . . e;,, be a solution of the homogeneous system '^-11 “ ^0^ «i Ai 2 as i • • i 0, ;.2i «i + A22 — • • •-} Kv “n 1 ^*11 2 ^2 H ' ‘ ■ t 11 *^0 ^^llerc the a, are not all ccjual to zero. By 11.8, such a solution certainly exists, since jA ^ ^oIl-O. Further, we may obviously suppose i] aj =- 1. Then the hyperplane Hq with the equation ^ — 0 has the required properties. Let, in fact, do denote the distance from the point x to Hq, while d is the distance to any other hyperplaiie — 0. We then have, writing Zf ~ fi^ — rc^ and bearing in miud that 1 cc; fi; 1 , (i,-r \ z, 1. t =" - «<■ -t 2 - - >-,» “* I - '-n h I, k % k i,k = E d^i f- 2 ;<•„ tt, ^^ + r,. z,, { /,k = E{dl<+ ^ Xu -xoen z,z,, 1. k where the are the elements of the unit matrix /. Since Xq is the smallest char- aeteristic number of A, the matrix A — is by 11.10 non -negative, and thus we have £(</*) ^ It can further he shown that, if Xq is a simple root of the secular equation of At the orthogonal regression plane Hq found in this way is unique, whereas if Xq is a multiple root, there are an infinity of planes with the required properties. These results become intuitive, if we remember that, by ,11.9.6), the reciprocal 309 23.6-24.1 matrix huH tbe characteristic numbers ^ , so that the squares of the principal axes of the concentration ellipsoid ,22.7,1) are proportional to the numbers This shows that the orthogonal m. sq. regression plane is orthogonal to the smallest axis of the concentration ellipsoid, and is thus determinate or indeterminate, according as this smallest axis is unique or not. We can also define a straight line L of closest fit to the distribution, by the condition that E should be a minimum, where <1 denotes the shortest distance between L and a point x. It can be 8bo^^n that this line coincides with the axis of the concentration ellipsoid. CHAPTER 24. The Normal Distribution. 24.1. The characteristic function. — As in the two- variable case (21.11 and 21.12), we introduce first the c. f. of the normal distribu- tion. Let (^) ~ 2 denote a non-negative quadratic form in t while m — (wq, . . ., nin) is a real vector. Wc shall then shoiv that the /miction (24.1.1) q )[ t ) = y;(/,. • •, tii) C J ' ' is the c.f. of a certain distribution in JR„. This distribution will be called a normal distribution. Before proceeding to the proof of this statement, which will be given in the two following paragraphs, we shall make some introductory remarks. — In matrix notation (cf 11.2 and 11.4), the expression (24.1.1) of the c.f. may be written (24.1.2) = *. The development (22.4.1) shows that the quantities vij and Ijk have here their usual signification as mean values and second order central moments. By (22.4.3), it further follows that any fiiaigwal distribution of a normal distribution is itself normal. If the moment matrix A^\Xji;] is a diagonal matrix, the c.f. (24.1.1) breaks up into a product (f\[ti) . . . y>n(^i), where each factor 310 24.1-2 is the c. f. of a one-dimensional normal distribution. Thus n uncor- related and normally distributed variables are always independent. As in the two-variable case, we shall have to distinguish two cases, according as the non-negative form Q is definite or semi-definite. Obviously we may suppose throughout that m = 0, since this only involves the addition of a constant vector to the variable * = (^j, . . f„). We use the same notations for moments, correlation coefficients etc. as in the preceding chapters. 24.2. The non-singular normal distribution. — If the quadratic form Q is definite positive, the reciprocal form exists, and we have (cf. 11.7) Q{^) ~ j,k W = 0“’ (jj|. • • •, -Vn) =2 j.t ^ (Since the moment matrix A is symmetric, we are entitled to write yljk instead of Aicj.) By (11.12.1 b) we then have (2 Tty^ I A Ru ■i <r' (j, This shows that the function (24.2.1) /(*)=-“--«■ ■"-S''-'"'" (2 7r)- I' A 1 e ^ (2 7i)'^ a I . . . On K P ts a probability density in Rn, with the c.f. — (24.2.2) 9{t) = e Substituting in (24.2.1) xj — fHj for xj, we obtain {he fr. f . of the general non-singular normal distribution in R;,, the c. f. of which is given by (24.1.1). For this distribution, the family of homothetic 311 24.2-4 ellipsoids « . ^ ~ ““ ’^*0 — c‘ generated by the concentra- ^ ^ j, k tiori ellipsoid (22.7.1) are equiinohahility surfaces^ the fr. f . being on one of these surfaces proportional to (cf Ex. 15, p. 319). 24.3. The singular normal distribution. — When the non-negative form Q is semi-definite, no reciprocal form exists, and the expression (24.2.1) for the fr. f. becomes indeterminate. As in the two-dimensional case (cf 21.12) we find, however, that the function gp (t) = e ‘1 may be represented as the limit of a sequence of functions of the same type, but with definite forms Qr> (We may, e. g., take 2' » where By the continuity theorem of 10.7, it then follows that the corresponding non-singular normal distribu- tions tend to a limiting distribution, and that q)[t) is the c. f. of this limiting distribution, which will be called a siugnJar normal (list rt hut ion. If the rank of the semi-detinite form Q is denoted by r, we have ;• <' w, and the moment matrix A of the variables has the same rank r. It then follows from 22.5 that tlie total mass of the distribution is confined to a certain linear set Lr of r dimensions. Further by 22.fi the variables i*,, may with a probability equal to 1 be expressed as linear function.^ of r un correlated variables 7^1, . . ., which are themselves linear functions of the Now it will be shown in the following paragraph that any linear functions of normally distributed variables arc themselves normally distributed, and by 24.1 we know that uncorrelated normally distributed variables are always independent. Hence we deduce the following theorem: If thv 11 rariahles J,, . . arc distrihuted in a normal distriimtion of rank i\ they can ivith a prohahility e(jual to 1 he expressed as linear functions (f r indejiendent and normally distrihulfxl vffriahlcs. — Ob- viously this theorem holds true also for r - n. 24.4. Linear transformation of • normally distributed variables. — The expressions normal disfrihution and normally disirihuied rariahles will in the sequel always be understood so as to include singular as well as non-singular distributions. Let the variable jc — (gj, . . ., f„) have a normal distribution in /?„, such that m = 0. By the linear transformation (22.6.1), we introduce a new variable y pni), where m is not necessarily equal to n, 312 24.45 In matrix notation we then have y= Cx, where C= Cmn. Between the moment matrices A and M of x and y, we have by (22.6.2) the relation M = CAC\ which holds even when m 7*^ 0. We shall now try to find the c. f. of y. By (24.1.2), the c. f. of X is in matrix notation T (*^) ^ ^ *) “ ^ ^ If we replace here f by a new variable u by means of the contra- tjredient substitution t C' u, we obtain according to (22.6.3) the e. f. xp(u) of y. We thus have ipiu) “ E{(f “V) - e i «' ^A C'u ^ (r I m'Mm. The last expression is, however, the c. f. of a normal distribution in Rm, with the moment matrix M. 77/?/.v avt/ nuwher of linear fund iovs nf normally didributed rariahles arc th€ms(drcs normally distrihntcd. — The remark of 24.1 that any marjjinal disti'ibution of a normal distri- bution is itself normal, is included as a particular case in this pro- ])Osition. 24.5. Distribution of a sum of squares. — In IS. I, we liave stu- died the distribution of the sum ^ ff, where the i,- are independent j and normal (0, 1). This is the distribution with n decrees of free- dom, and the fr. f. of is the function X',, (^) defined by (18.1.3). On a later occasion (cf 30.1 — 30.3), we shall require the distribu- tion of 2^’ more general case when are normally distributed with zero means and a moment matrix A, the characteristic numbers (cf 11.9) of which are all equal to 0 or 1. Suppose that jt of the characteristic numbers are 0, while the n—j^ others are 1. Then we may find an orthog'onal transformation y = C jr replacing the old variables x = (^'j, . . ^n) by new variables y — - T],f such that the transformed moment matrix M -- C A C' is a diagonal matrix with its n — p first diatronal elements equal to 1, while tlie 2 ^ others are 0. This implies, however, that the new variables ryj, . . ., ry„-/> are independent and normal (0, 1), while 7]„.-p+i, . . .. rj,, have zero means and zero variances, and are thus with the probability 1 equal to zero. Hence we have with the probability 1 313 24.5-6 n n n—p 2 ~ <2 ~ 2 1 1 1 V Thus 2^’ distributed as the sum of the squares of n — p inde- 1 V pendent variables that are normal (0, 1), i.c. 2^* distribution 1 with n — p degrees of freedom, and the fr.f kn^p[x). We finally consider the still more g-eneral case of a sequence of variables x, x\ . . such that the distribution of the general term :c = ^n) tends to a normal distribution of the type considered above. Applying 10.7 and the multi dimensional form of (7.5.9) to the c. f. of n n it then follows that, in the limit, the sum of squares ^ ^ 1 1 distribution with n — p degrees of freedom. 24.6. Conditional distributions. — Let be n variables having a non-singular normal distribution with m = 0, the fr. f. of which is given by (24.2.1). The conditional fr. f . of a certain number of the variables, when the remaining variables assume prescribed values, is given by an expression of the form (22.1.1), and it is easily seen that in the present case this is always a non singular normal fr. f. We shall treat as examples the conditional distributions for one and two variables. One variable. — The conditional fr. f . of relative to the hypo- thesis =Xi for i = 2, . . ., n, is by (22.1.1) /(x, |scs, . . a:„) = — ;; — I d — OO 1 ‘i A — Ae ^A„ :r*4-2 S = Be s where A and B are independent of but may depend on x^, . . ., Xn. Now we know that the last expression is a fr. f. in x^, and it follows 314 24.6 that we must have B SO that the conditional distribution of is a normal distribution with the variance — and the mean (rTg, . . . , Xn) =-- ■^18 - a:,- ^11 -^1 n ^11 — ^12 4* • • • H" where the /?:8 are the regression coefficients given by (23.2.3). Thus the regression is linear, and accordingly (cf 23.2) we find that the regression surface for the mean of coincides with the m.sq. regres- sion plane. We further observe that the conditional variance — - is -^11 independent of , X/, , and is equal to the residual variance E{r}i.‘is...n) as given by (23.3.4). varial)l('.s. — The conditional fr. f. of and is ./ ('^1 1 I S’ • S ^7 k • • 1 ^ h ) — ^ ■■ dx^ dx^ y ~~ ^2 A + 2 Ji5 .4’i J'jf I Jas “‘’a ) “H ^>*‘ 1 + Ce where C, I) and E are independent of and x^. We now introduce three quantities and r defined by the expressions ^2 __ -^22 ^ ^.2 _ ^11 ^ r = ^11-22 * ^n-22 V We then obtain by (11.7.3) 1 A\\ -^x 'n _ A^ x (I -'r^)s\ “ AnA,~- Al, ~A ’ and in a similar way 1 -<^22 _ A ]i so that 315 24.6-7 A (a II a 2 A 14 ■^•2 . 1 42 1 /x; 1 -r*\s! 2rx,Xg x^\ s, Sj, *•; / Comparing this with the expression of the two-dimensional normal fr. f. given in 21.12, we find that the conditional distribution of and ^2 a non-singular normal distribution with the conditional i i variances " ** and , and the conditional correlation coefficient — We observe that all these three quantities are indepen- ^ Ail A S2 dent of Xn. The variances are identical with the variances of the residuals 2^1.34 „ and 122*34 .n studied in 23.4, while the condi- tional correlation coefficient is identical with the correlation coefficient of these two residuals, or the partial correlation coefficient ()i2‘:t4 n as given by (23.4.2). For the normal distribution, the latter coefficient has thus the important property of showing not only the correlation between the residuals but, moreover, the correlation between and ^2 for any fixed values of £3, , . ., 24.7. Addition of independent variables. The central limit theorem. - The sum of two n-diraeiisional random variables :r — (§1, . . . , ^n) and y = ^ rjn) is defined as in the two-dimensional case (cf 21.11), by writing « + y = (m + Vi' • • Vi>)- in 21.11, it is proved that the c. f. of a suvi of independent variahhs is the product of the c.f’s of the terms. The expression (24.1.1) for the c. f. of the normal distribution' farther immediately shows that the sum of any number of normally distributed and independent variables is itself norynally distributed, as proved for the one-dimensional case in 17.3. In 21.11, we have considered a sum of a large number of indepen- dent two-dimensional variables, all having the same distribution. We have proved that, if the sum is divided by the square root of the number of terms, the distribution of this standardized sum tends to a certain normal distribution, as the number of terms tends to in- finity. A straightforward generalization of the proof of this theorem shows that the theorem holds for variables in any number of dimen- sions. — This is the generalization to n dimensions of the Lindeberg- Levy theorem of 17.4, and thus forms the simplest case of the Central Limit Theorem for variables in Rn. The general form of this theorem asserts that, subject to certain conditions, the bum of a large 31 (> Exercises number of independent n-dimensional random variables is asymptotically normally distributed. — The exact conditions for the validity of the theorem, in the general case when the terms may have unequal dis- tributions, are rather complicated, and we shall not go further into the matter here. A fairly general statement will be found in Cramer, Ref. 11, p. 113. Exercises to Chapters 21 — 24. 1 . ^ aud are two variables with finite second order moments. Show that i>‘ (^ + r/) = D'(^) -f when and only when the variables are uncorrelated. 2. Let <p 2 {t) and (p{tj denote the c. f;s of and ^ -f v respectively. It has been shown in 16.12 that fpit) — when ^ and r} are independent. Con* vcrsely, if we know that tf (0 == tpi (f' (t) for all f, does it follow that g and t] are independent? — Consider the fr. f. /Cjt, y) — J [1 -f xy (j** y’)], ' | cr | < 1, | ;/ 1 < 1), and show by means of this example that the answer is ne^jative. 3. Consider the expansion ;21.3.2) for the c. f. of a two diniensional distribution. Show that, if the distribution has finite moments of all orders, this expansion may be extended to term.s of any degree in t and u. Use this expansion to show that, for the normal distribution, any central moment of even order i-\h~2n is equal to the coefficient of P in the polynomial ilk! n ] f + 2//1, (u + u’)". 4 . The joint distribution of | and y is normal, with zero mean values and the correlation coefficient p. Show that the correlation coefficient of and rf^ is p*. 5. Consider two variables ^ and Avith a joint distribution of the continuous type, and let <p(f.u) denote the joint c. f. Using the notations of 21.4, we then have VV e' < d X oo I y’‘ /(■>■. Uj di/ =- E'-tj" 1 f - X /■,(.»■ djr. t'onversely, there is a reciprocal formula analogous to (10.3.2 E I = J* 2. if the last integral is absolutely convergent. Use this result to deduce the properties given in 21.12 of the conditional mean and the conditional variance of the normal distribution. 6. We u.se the same notations as in the preceding exercise, and .suppo.se that /j is never negative. If the integral ' oo — yo 317 Exercises is uniformly convergent with respect to ./■, it represents the fr. f. gix) of the variable (Generalization of (Yaiuer, Kef. 11, p. 46, who gives the proof for the particular case when ^ and r/ are independent.) Use this result to deduce the distributions of 18.2 and 18.3, and generalize 8tudent‘.s distribution to the case w'hen the variable ^ in (18.2.1) is normal (m, a), where m / 0 the » non-central » f-distribution). 7. Find the necessary and sufficient conditions that three given numbers (>i 3 and P 23 niay Ihe correlation coefficients of some three-dimensional distribution. Find the possible values of c in the particular case when Pi 2 Pi 3 — ^23 “ c. 8. Each of the variables x, y and z has the mean 0 and the s. d. 1. The \ari- ables satisfy the relation ax hy cz = 0. Find the moment matrix A.i and sho\N that wc must have a* -f h* -f c* 2 (a* />* -f a* -■}- c"). 9. A certain random experiment may produce any of n mutually exclusive events 7» Ey, .... Elly the probability of Ej being > 0, Avhere S />^- == 1. In a series of .V 1 n repetitions, Ej occurs times, where — N. Show that the ]>robability of this 1 result is — ; The joint di.stribution of Vi, . . ., vn defined by these l^i! . . . Vnl ‘ ” probabilities is a generalization of the binomial distribution, known as th^ multinomial distribution. Show that for this distribution mj = E {vj) = Alpy ).jj — E{yj~- Npjf — NpjiX ~~ Ppi ~ ^PjPk' moment matrix yl, anc have ^1 = 0 and = AT’* pxP^. . pn ¥- 0, so that the rank of the distribution is n — 1 , n in accordance with the relation Vj — A^ between the variables. 1 Show that (^12 for j ~ 3, . . . , 71. "‘■'-Ku-S j" ~Vn ami Pj)(l -Pt) Pi Pi Pi —P3 - ■■ ■ - Pj){l -Pt — Pa- ■■■ —P,) V — Ep ■ Show further that the joint c. f. of the variables x. — ---- • “ / V JV S tj y PJ . n c 1 ^ tends to the limit This is the c. f. of a normal distribution in Rn . Show that this distribution is of n rank n — 1, and that the variables satisfy the relation S Pj = Find pi 2 and 1 Pl2-34 .../. 318 Exercises 10 . Take in the multinomial distribution for j? = 1, . . n — 1, and — Ai + • • • 4* ^;|_l 1 — • Investigate the limiting distribution as co (multidimensional N Poisson distribution). 11 . Show that the residual n by (23.3.1) may also be interpreted as the residual of the variable Vi .23 n-i respect to the single variable '^n-23. .n-i* that, by means of this result, the formula (23.4.4) for the partial correlation coefficient may be deduced from (23.4.31 12 . Use the result of the preceding exercise to prove the relation ^(*?i-23 tj)~‘^(^l-23 . . «-l) (^ “ ^^ln-23 . . 7i-l). This shows that the representation of by •means of a linear combination of <^ 2 . improved by including also the further variable when and only when n-i ^ 13 . Prove the relations (23.4.6) and (23.6.3). 14 . Use the continuity theorem 10.7 to prove the following proposition: If a sequence of normal distributions in converges to a distribution, the limiting distribution is normal. (Note that, in accordance with 24.4, the expression »normnl distribution** includes singular as well as non singular distributions.) 15 . The variables <?! , . . , have a non-singular normal distribution, with the mean values and the moment matrix A. Use (11.12.3) and the finul remark of 24.2 to show that the variable '/ = 2 hk=l has a /^-distribution with n degrees of freedom, the fr. f. being given by (18.1.3). 16 . , . . . , are independent and normally distributed variables, all having the same s. d. o, while the mean values may be different. New variables j/j,..., are introduced by an orthogonal transformation. Show by means of 24.4 that the t/,- are independent and normally distributed, all having the same s. d. o as the I,. 319 THIRD PART STATISTICAL INFERENCE 21 — 464 Chapters 25-26. Generalities. CHAPTER 25. Preliminary Notions on Sampling. 25.1. Introductory remarks. — In accordance with our general discussion of principles in Chs 13 — 14, the whole theory of random variables and probability distributions developed in Part II should be considered as a system of mathematical propositions designed to form a model of the statistical regularities observed in connection with sequences of random experiments. As already pointed out in 14.6, it will now be our task to work out methods for testing the mathematical theory by experience, and to show how the theory may be applied to problems of statistical in- ference. — These questions will form the subject matter of Part III. Among the sets of statistical data occurring in practical applica- tions, we may distinguish certain general classes which, in some ways, require difPerent types of theoretical treatment. In the present chap- ter, we shall give a few brief indications concerning some of the most important of these classes. — The following chapter will be devoted to a preliminary survey of questions of principle connected with the testing and applications of the theory. 25.2. Simple random sampling. — Consider a random experiment connected with a one-dimensional random variable If we make n independent repetitions of ©, we shall obtain a sequence of n ob- served values of the variable, say Xj, . . ., Xn. A sequence of this type, forming the result of n independent re- petitions of a certain random experiment, is representative of a simple but fundamentally important class of statistical data. With respect to data belonging to this class, we shall often use a current termino- logy derived from certain particular fields of Application, as we are now going to explain. Consider a random experiment 6 of the following type: A certain set containing a finite number of elements is given, and our experi- 323 25.2 ment consists in choosing at random an element from the set, observing the valne of some characteristic ^ of the element, and then replacing the element in the set. It is assumed that the experiment is so ar* ranged that the probability of being chosen is the same for all ele- ments. — Using expressions borrowed from the statistical study of human and other biological populations, we shall talk of the given set as the parent 'population^ and of its elements as members or indivi’ duals (cf 13.3). The group of individuals observed in the course of n repetitions of the experiment C£ will be called a random sample from the population, and the sampling process thus described will be denoted as simple random sampling. Often we are not interested in the individuals as such, but only in the values of the variable characteristic ^ and their distribution among the members. In such cases we shall find it advantageous to consider the parent populations as composed, not of individuals, but of values of A sequence of n observed values iCj, . . .. Xn will then be conceived as a random sample from this population of ^-values. Talking from this point of view, we may replace the parent population by an urn containing one ticket for each member of the population, with the corresponding value of ^ inscribed on it. The experiment iS will then consist in drawing at random a ticket, noting the value inscribed, and replacing the ticket in the urn. As there are only a finite number of tickets in the urn, the random variable ^ will only have a finite number of possible values, so that its distribution will be of the discrete type (cf 15.2). By taking the number AT of tickets very large, this distribution may, however, be made to approximate as closely as we please to any distribution given in advance, and when N tends to infinity the error involved in the approximation may be made to tend to zero. As a matter of illustra- tion, we may thus interpret any type of random experiment (£ as the random selection of an individual from an infinite parent population (cf 13.3). We then imagine an urn containing an infinite number of tickets, on each of which a certain number is written, in such a way that the distribution of these numbers is identical with the distribution of the random variable ^ associated with (£. Each performance of G is now interpreted as the drawing of a ticket from this urn, and a sequence . . ., Xn of observed values of ^ is regarded as a random sample from the infinite population of numbers inscribed on the tickets. The values ,/‘i, . . ., will accordingly be called the sample values. 324 25.2-3 It must be expressly observed that this extension of the idea of sampling to the case of an infinite population should be regarded as a mere illustration for the purpose of introducing a convenient ter- minology, and should by no means be taken to imply that conceptions such as the random selection of individuals from an infinite population form part of our theory. Bearing this reservation in mind we shall, however, often find it convenient to use the sampling terminology in the extended sense suggested above. A set of observed values of a random variable with a certain d. f. F{x) will thus often be regarded as a random sample from a populatio)i harinfj the d. f. F{x) or, as we shall sometimes briefly say, a random sample from the distrihntion corresponding to F{x). Whenever in the sequel expressions such as » sample'^ or sampling» are used without further specification, it will always be understood that we are concerned with simple random sampling. All the above may be directly extended to the case of a random variable in any number Jc of dimensions. Every individual in our imaginary infinite population will then be characterized by a set of k numbers, and any sequence of observed values of the A^-dimensional random variable may be interpreted as a random sample from such an infinite /-dimensional population. 25.3. The distribution of the sample. — Consider a sequence of n observed v^alues x^^ . , Xn of a one-dimensional random variable ^ with the d.f. i^’(.r). According to the preceding paragraph, we may regard Xj, . . . ., Xu as a set of sample values, »drawn^ from a popula- tion with the d. f. F[x). The sample may be geometrically represented by the set of n points Xj, . . Xn on the .r-axis. The distribution of the sample will then be defined as the distribu- tion obtained by placing a mass equal to \ln in each of the points . . ., x„. This is a distribution of the discrete type, having n di.screte mass points (some of which may, of course, coincide). The corresj)onding d.f., which will be denoted by is a step-function with a step of the height 1/^/ in each x,-. If we denote by v the number of sample values that are ^ x, we evidently have ( 25 . 3 .]) n so that /’’* (x) represents the frequency ratio of the event ^ ^ x in our sequence of n observations. 325 25.3 Obviously this distribution is uniquely determined by the sample. On the other hand, two samples consisting of the same values in difiFerent arrangements will give the same distribution. The distribu- tion determines, in fact, only the positions of the sample values on the a;-axis, but not their mutual order in the sample. For the distribution thus defined, with the d.f. we may calculate various characteristics such as moments, semi-invariants, coef- ficients of skewness and excess etc., according to the general rules for one-dimensional distributions given in Ch. 15. These characteristics will be called the moments etc. of the sample^ as distinct from the corresponding characteristics of the distHhution associated with the random variable 5 and the d. f. F[x). The latter characteristics will also be called the moments etc. of the population. Thus e. g. by 15.4 the r:th moment of the sample is 00 — 00 ^ i, e. the arithmetic mean of the v:th powers of the sample values, oo while the corresponding moment of the population is or,. ~ ^x'^dF[x). — 00 The above definitions directly extend themselves to samples from multi-dimensional populations. Suppose e. g. that we have a sample of n pairs of values yX • • (^n, .Vn) of a two-dimensional random vari- able. This sample may be geometrically represented by the set of n points (a?!, (xn, in a plane, and the distribution of the sample is the discrete distribution obtained by placing a mass equal to Ijn in each of these n points. For this distribution, we may calculate moments, coefficients of regression and correlation, and other char- acteristics according to the general rules for two-dimensional distribu- tions given in Ch. 21. These are the moments etc. of the sample as distinct from the corresponding characteristics of the distribution (or of the population). — The extension to samples from populations of more than two dimensions is obvious. The distribution of a sample, as well as the moments and other characteristics of such a distribution, will play an important part in the sequel. In this connection, we shall use a particular system of notations that will be explained in 27.1. 326 25.4-25.5 25.4. The sample values as random variables. Sampling distri- butions. — In order to obtain a sample of n values of a one-dimen- sional random variable with the d. f. F[^), we have to perform a sequence of n independent repetitions of the random experiment (£ to which the variable is attached. This sequence of n repetitions forms a combined experiment, bearing on n independent variables Xv, where xt is associated with the ^:th repetition of (£. The sample values a?!, . . Xn that express the result of such a combined experiment thus give rise to a combined random variable :r„) in n dimensions, where the Xi are independent variables, all of which have the same d. f. F(x). The values of x^, . . Xn observed in an actual sample form an observed »value» of the w-dimensional random variable [x^, . . x^^. When the sample values are thus conceived as random variables, any function of iTn is by 14.5 a random variable with a distri- bution uniquely determined by the joint distribution of the jv, i. e. by the d. f. F[x). Now any moment or other characteristic of the sample is a certain function g[x^, , . Xn) of the sample values. Co}}sequenihj any sample characteristic gives rise to a random variable with a disfribu- tion uniquely determined by F[x). If samples of n values are repeatedly drawn from the same popula- tion, and if for each sample the characteristic g{x^^ . . ., a:,*) is calcul- ated, the sequence of values obtained in this way will constitute a sequence of observed values of the random variable g[xy^ . . ., a:,,). The probability distribution of this variable will be called the sampling distribution of the corresponding characteristic. These remarks are immediately extended to the case of samples from multi-dimensional populations. In the same sense as above, the sample values will here be conceived as random variables. Further, any moment, correlation coefficient or other characteristic of such a sample is a function of the sample values, and thus gives rise to a certain random variable, the distribution of which is uniquely deter- mined by the distribution of the population. This is the sampling distribution of the characteristic. Thus we may talk of the sampling distribution of the mean of a sample, of the variance, the correlation coefficient etc. The properties of sampling distributions of various important sample characteristics will be studied in Chs 27 — 29. 25.5. Statistical image of a distribution. — As an example of the concepts introduced in the preceding jmragraph, we consider the d. f. 327 25.5 Fig. 25. Slim polygon for 100 mean temperatures Celsius) in Stockholm, June 1841 — 1940, and normal distribution function. of a one-dimensional sample, which by (25.3.1) is a function of the sample values, containing: a variable parameter .r. As observed in 25.3, F*{x) is equal to the frequency of the event ^ S a? in a sequence of n repetitions of (£. Now, by the definition of the d. f. F{x) of the variable f, the event has the probability Thus it follows from the Bernoulli theorem, as interpreted in 20.3, that F*(x) con- verges in probability to F{x), as w -► oo. When u is large, it is thus practically certain that the d. f. F*[x) of the sample will be approximately equal to the d. f. F(x) of the population. Consequently we may regard the distribution of the sample as a kind of statisiical image of the distribution of the population. The graph ;/ == F’*' (x) of the step-function is known as the sum polygon of the sample. For large values of w, this will thus be ex- pected to give a good approximation to the curve y = F[x). As an example, we show in Fig. 25 the' sum polygon for a sample of 100 mean temperatures in Stockholm for the month of June (cf Table 30.4.2), together with the (hypothetical) normal d. f. of the corresponding population. In practice, samples from continuous distributions are often grouped. This means that we are not given the individual sample values, but only the number of sample values falling into certain specified class 328 25.5 Fig. 20. Histogram for the breadths of 12 000 beans, and freciiiency curve according to Edgeworth’s scries. The scale on the horizontal axis refers to a conventional numeration of the class intervals. iutervah. We then take every class interval as the basis of a rectangple V with the heio^ht where h is the length of the interval, while v denotes the number of sample values in the class. The figure obtained in this way is the histogram of the sample. The area of any rectangle V in the histogram is equal to the corresponding class frequency " • For large n this may be expected to be approximately equal to the probability that an observed value of the variable will belong to the corresponding class interval, which is identical with the integral of the fr. f. f[x) over the interval. Thus the upper contour of the histo- gram will form a statistical image of the fr. f., in the same way as the sum polygon does so for the d. f. As an example, we show in Fig. 26 the histogram of the sample of 12 000 breadths of beans given in Table 30.4.3, together with the (hypothetical) fr. f. of the corre- sponding population, according to the Edgeworth expansion (17.7.5). Analogous remarks apply to the distribution of a sample in any number of dimensions. Later on, we shall find that the same kind of relationship also exists between the various characteristics of the distributions of the sample and of the population. It will, in fact, be shown in 27.3 and 27.8 that, under fairly general conditions, a characteristic of the sample converges in probability to the corre- sponding characteristic of the population, as the size of the sample tends to infinity. In such cases, the sample characteristics may be regarded as estimates of the corresponding population characteristics. The systematic investigation of such estimates and their probabilitity distributions will, in the sequel, provide some of the most powerful tools of statistical inference. 329 25.6 25.6. Biased sampling. Random sampling numbers. — When we are concerned with a finite parent population, the idea of simple random sampling has a precise and concrete significance. We may always imagine an experimental arrangement satisfying the conditions for a random selection of individuals from such a population, with equal chances for all the individuals, even though its practical realiza- tion may sometimes be exceedingly difficult. In practice there will often be a bias in favour of certain individuals or groups of individuals, and accordingly we then talk of a biased sampling. Experience shows e. g. that such a bias is always to be expected when the selection of individuals from a population is more or less dependent on human choice. It does not enter into the plan of this book to give an account of questions belonging to the technique of random sampling, such as the arrangements by which bias may be as far as possible eliminated. We shall only remark that in many cases it is possible to use with advantage some of the published tables of random sampling numbers. (Ref. 262, 263, 267.) Such a table consists of a sequence of digits intended to represent the result of a simple random sampling from a population consisting of the ten digits 0, 1, . . ., 9. By joining two columns of the table we may obtain a sequence of numbers formed in the same way from the population consisting of the 10* numbers 00, . . ., 99, and similarly for three, four or any larger number of columns. Suppose that we want to use such a table to draw a random sample of 100 individuals from a population consisting of, say, 8183 mem- bers. The members are first numbered from 0000 to 8182. We then read a sequence of four-figure numbers from the table, disregarding numbers above 8182, and go on until we have obtained 100 numbers. Our sample will then consist of the members corresponding to these numbers. If the sampling is to be made without replacement (cf 25.7), we must also during the course of reading the numbers from the table disregard any number that has already appeared. The tables may also be used to obtain a sample of observed values of a random variable with any given d. f. F[x). Suppose that we dispose of a table of values of F{x) that enables us, for every m-figure number r, to solve the equation i^(flr) = r • iO"*" with respect to cTr- From our table of random numbers, we now read a sequence of m-figure num- bers r, and determine the sample values x such that the x corre- sponding to any r falls in the interval ar <x^ Cr+i. Thus we obtain in this way a grouped sample: the sample values are not exactly deter- mined, but the process yields the number of sample values belonging 330 25.6-7 to any interval (or, flr+i), and it is seen that the probability for any sample value to fall in this interval has the correct value F(ar,,)-F(ar)=10-”^. The larger we take wz, the finer is the grouping and the more accurate the determination of the sample values. — Further discussion of the tables of random sampling numbers and their use will be found in the introductions to the tables and in two papers by Kendall and Babing- ton Smith (Ref. 137). 25.7. Sampling without replacement. The representative method. — In practice, a sample from a finite population is often taken in such a way that a drawn individual is not replaced in the population before the next drawing. A sequence of drawings of this type has obviously not the character of repetitions of a random experiment under uniform conditions, since the composition of the population changes from one drawing to another. We talk here of sampling without replacement, as distinct from simple random sampling, which is a sampling with replacement. When the population is very large, and the sample only contains a small fraction of the total population, it is obvious that the difference between these modes of sampling is un- important, and in the limiting case when the population becomes infinite, while the size of the sample remains finite, the diflPerence disappears. Sampling without replacement plays an important part in applied statistics. When it is desired to obtain information as to the charac- teristics of some large population, such as the inhabitants of a country, the fir-trees of a district, a consignment of articles delivered by a factory etc., it is often practically impossible to observe or measure every individual in the whole population. The method generally used in such situations is known as the representative method: a sample of individuals is selected for observation, and it is endeavoured to make the sample as representative as possible of the total population. The observed characteristics of the sample are then used to form estimates of the unknown characteristics of the total population. Usually in such cases samples are taken without replacement. The method of selection may be random or jturposive\ in the latter case we deliberately choose the individuals entering into our sample in order to obtain a representative sample. Often also mixed methods are used. — For the theory of the representative method, we refer to Neyman, Ref. 161. Some simple cases will be considered in 34.2 and 34.4. 331 Z6.1-2 CHAPTER 26. Statistical Inference. 26.1. Introductory remarks. - - It has been strong^ly emphasized in 13.4 that no mathematical theory deals directly with the things of which we have immediate experience. The mathematical theory be- longs entirely to the conceptual sphere, and deals with purely abstract objects. The theory is, however, designed to form a model of a certain group of phenomena in the physical world, and the abstract objects and propositions of the theory have their counterparts in certain observable things, and relations between things. If the model is to be practically useful, there must be some kind of general agreement between the theoretical propositions and their empirical counterparts. When a certain proposition has its counterpart in some directly observable relation, we must require that our observations should, in fact, show that this relation holds. If, in rc])eated teats, an agree- ment of this character has been found, and if we regard this agree- ment as sufficiently accurate and permanent, the theory may be ac cepted for practical use. In the present chapter, we shall discuss some points that arise when these general principles are applied to the mathematical theory of probability. We shall first consider the testing of the agreement between theory and facts, and then proceed to give a brief surve'v of the applications of the theory for purposes of statistical inference. 26.2. Agreement between theory and facts. Tests of significance. — The concept of mathematical probability as defined in 13.5 has its empirical counterpart in certain directly observable frequency ratios. The proposition: »The probability of the event E in connection with the random experiment 6 is equal to P» has, by 13.5, its counterpart in the statement denoted as the frequency intn'pretation of the prob- ability P, which runs as follows: »In a long sequence of repetitions of tS, it is practically certain that the frequency of E will be approximately equal to P>. Accordingly tve must require that, whenei'er a theoretical deduction leads to a definite numerical value for the prohahility of a certain ohsei'vahle event, the truth of the corresponding frequency interpretation should he borne out hy our observations. 332 26.2 Thus e. when the probability of an event is very small, we must require that in the lon^ run the event should occur at most in a very small percentage of all repetitions of the corresponding experiment. Consequently we must be able to regard it as practically certain that, in one single performance of the experiment, the event will not occur (cf 13.5). — Similarly, when the probability of an event differs from unity by a very small amount, we must require that it should be practically certain that, in one single performance of the corresponding experiment, the event will occur. In a great number of cases, the problem of testing the agreement between theory and facts presents itself in the following form. We have at our disposal a sample of n observed values of some variable, and we want to know if this variable can be reasonably regarded as a random variable having a probability distribution with certain given properties. In some cases, the hypothetical distribution will be com- pletely specified: we may, e. g., ask if it is reasonable to suppose that our sample has been drawn by simple random sampling from a population having a normal distribution with ni = 0 and a — 1 (cf 17.2). In other cases, we are given a certain dass of distributions^ and we ask if our sample might have been drawn from a population having some distribution belonging to the given class. Consider the simple case when the hypothetical distribution is completely specified, say by means of its d. f. F[x). We then have to test the statistical hi/pothcsis that our sample has been drawn from a population with this distribution. We becjin hjf assuming that the hypothesis to he tested is true. It then follows from 25.5 that the d. f. (x) of the sample may be expected to form an approximation to the given d. f. F(x), when r/ is large. Let us define some non-negative measure of the deviation of from F. This may, of course, be made in various ways, but any deviation measure I) will be some function of the sample values, and will thus according to 25 4 have a determined sampling distribution. By means of this sampling distribution, we may calculate the prob- ability P(Z) > />o) that the deviation D will exceed any given quantity Dq. This probability may be made as small as we please by taking Po sufficiently large. Let us choose Dq such that P(7> > Z^o) == e, where e is so small that we are prepared to regard it as practically certain that an event of probability a will not occur in one single trial. Suppose now that we are given an actual sample of n values, and 333 26.2 let us calculate the quantity D from these values. Then if we find a value D > Dq, this means that an event of probability £ has presented itself. However, on our hypothesis such an event ought to be practically impossible in one single trial, and thus we must come to the conclusion that in this case our hypothesis has been disproved hy experience. On the other hand, if we find a value D ^ Dq, we shall be willing to accept the hypothesis as a reasonable interpretation of our data, at least until further experience has been gained in the matter. This is our first instance of a type of argument which is of a very frequent occurrence in statistical inference. We shall often encounter situations where we are concerned with some more or less complicated hypothesis regarding the properties of the probability distributions of certain variables, and it is required to test whether available statistical data agree with this hypothesis or not. A first approach to the pro- blem is obtained by proceeding as in the simple case considered above. If the hypothesis is true, our sample values should form a statistical image (cf 25.5) of the hypothetical distribution, and we accordingly introduce some convenient measure D of the deviation of the sample from the distribution. By means of the sampling distribution of we then find a quantity 1)^ such that F{I) > = £, where t is deter- mined as above. If, in an actual case, we find a value I) > J)q, we then say that the deviation is significant, and we consider the hypo- thesis as disproved. On the other hand, when 1) ^ 2>o, the deviation is regarded as possibly due to random fluctuations, and the data are regarded as consistent with the hypothesis. A test of this general character will be called a test of significarice relative to the hypothesis in question. In the simple case when the test is concerned with the agreement between the distribution of a set of sample values and a theoretical distribution, we talk more specifically of a test of goodness of fit. The probability e, which may be arbitrarily fixed, is called the level of significance of the test. In a case when our deviation measure D exceeds the significance limit Dq, we thus regard the hypothesis as disproved by experience. This is, of course, by no means equivalent to a logical disproof. Even if the hypothesis is true, the event D > Dq with the probability e may occur in an exceptional case. However, when a is sufficiently small, we feel practically justified in disregarding this possibility. On the other hand, the occurrence of a single value D ^ Dq does not provide a proof of the truth of the hypothesis. It only shows that, from the point of view of the particular test applied, the agree- 334 26.2-3 ment between theory and observations is satisfactory. Before a sta- tistical hypothesis can be reg-arded as practically established, it will have to pass repeated tests of different kinds. In Chs 30 — 31, we shall discuss various simple tests of signifi- cance, and give numerical examples of their application. In Ch. 35, the general foundations of tests of this character will be submitted to a critical analysis. 26.3. Description. — In 13.4, the applications of a mathematical theory were roughly classified under the headings: Description, Analysis and Prediction. There are, of course, no sharp distinctions between the three classes, and the whole classification is only introduced as a matter of convenience. We shall now briefly comment upon some important groups of applications belonging to the three classes. In the first place, the theory may be used for purely descriptire purposes. When a large set of statistical data has been collected, we are often interested in some particular properties of the phenomenon under investigation. It is then desirable to be able to condense the information with respect to these properties, which may be contained in the mass of original data, in a small number of descriptive char- acteristics. The ordinary characteristics of the distribution of the sample values, such as moments, semi invariants, coefficients of re- gression and correlation etc., may generally be used with advantage for such purposes. The use of frequency-curves for the graduation of data, which plays an important part in the early literature of the subject, also belongs primarily to this group of applications. When we replace the mass of original data by a small number of descriptive characteristics, we perform a reduction of the data, according to the terminology of R. A. Fisher (Ref. 13, 89). It is obviously im- portant that this reduction will be so arranged that as much as pos- sible of the relevant information contained in the original data is extracted by the set of descriptive characteristics chosen. Now the essential properties of any sample characteristic are expressed by its sampling distribution, and thus the systematic investigation of such distributions in Chs 21 — 29 will be a necessary preliminary to the working out of useful methods of reduction. In most cases, however, the final object of a statistical investiga- tion will not be of a purely descriptive nature. The descriptive char- acteristics will, in fact, usually be required for some definite purpose. We may, e.g., want to compare various sets of data with the aid of 335 26.3-4 the characteristics of each set, or we may want to form estimates of the values of the characteristics that we expect to find in future sets of data. In such cases, the description of the actual data forms only a preliminary stage of the inquiry, and we are in reality concerned with an application belonging to one of the two following classes. 26.4. Analysis. — When a mathematical theory has been tested and approved, it may be used to provide tools for a scientific anah/si.^ of observational data. In the present case we may characterize this type of applications by saying that we are trying to aj'fjue from thr sample to the population. We are given certain sets of statistical data, which are conceived to be samples from certain populations, and we try to use the data to learn something about the distributions of the ]>opulations. A great variety of problems of this class occur in sta- tistical practice. In this preliminary survey, we shall only mention some of the main types which, in later chapters, will be more thor- oughly discussed. In 26.2, we have already met with the following type of problems: We are given a sample of observed values of a variable, and we ask if it is reasonable to assume that the sample may have been drawn from a distribution belonging to some given class. Are we, e. g., justified in saying that the errors in a certain kind of physical mea- surements are normally distributed? Or that the distribution of in- comes among the citizens of a certain state follows the law of Pareto (cf 19.3)? — In neither case the distribution of an actual sample will coincide exactly with the hypothetical distribution, since the former is of the discrete, and the latter of the continuous type. But are we entitled to ascribe the deviation of the observed distribution from the hypothetical to random fluctuations, or should we conclude that the deviation is significant, i. e. indicative of a real difference between the unknown distribution of the population and the hypothetical distribution? We have seen in 26.2 how this question may be attacked by means of the introduction of a test of significance. We then have to calculate a certain measure of deviation />, and in an actual case the deviation is regarded as significant, if 1) exceeds a certain given value while otherwise the deviation will be ascribed to random fluctuations. In other cases, we assume that the general character of the distri- butions is known from earlier experience, and we require information as to the values of some particular characteristics of the distributions. 336 26.4 Suppose, e. gf., that we want to compare the e£Pects of two different methods of treatment of the same disease, and let us assume that for each method there is a constant probability of recovery. Are the two probabilities different? In order to throw light upon the problem, we collect one sample of cases for each method, and compare the two frequencies of recovery. In general these will be different, and we are facing the same question as in the previous case: Is the difference due to random fluctuations, or is it significant, i. e. indicative of a real difference between the probabilities? Similar, though often more complicated problems arise in many cases, e. g. in agricultural, industrial or medical statistics, when we want to compare the effects of various methods of treatment or of production. We are then concerned with the means or some other characteristics of our samples, and we ask whether the differences between the observed values of these characteristics should be ascribed to random fluctuations or judged to be significant. In such cases, it is often useful to begin by considering the hypo- thesis that there is no difference between the effects of the methods, so that in reality all our samples come from the same population. (This is sometimes called the 7iull hypothesis) This being assumed, it will often be possible to work out a test of significance for the differences between the means or other characteristics in which we are interested. If the differences exceed certain limits, they will be regarded as significant, and we shall conclude that there is a real difference between the methods; otherwise we shall ascribe the differences to random fluctuations. This type of applications belongs to the realm of the statistical analysis of causes. Suppose, more generally, that we want to know whether there exists any appreciable causal relationship between two variables x and y that we are investigating. As a first approach to the problem, we may then set up the null hypothesis, which in this case implies that the variables are independent, and proceed to work out a test of significance for this hypothesis on the general lines indicated above. Suppose, e. g., that we are interested in tracing a possible connection between the annual quantities x and y of two commodities consumed in a given group of households. From a sample of observed values of the two-dimensional variable [x, y), we may then calculate e. g. the sample correlation coefficient r. In general this coefficient will be different from zero, whereas on the null hypothesis the correlation coefficient q of the corresponding distribution is equal 337 22 — 454 H. CranUr 26.4 to zero. Is the difference significant, or should it be ascribed to random fluctuations? In order to answer this question, we shall have to work out a test of significance, based on the properties of the sampling distribution of r. If r differs significantly from zero, this may be taken as an indication of some kind of dependence between the vari- ables. The converse conclusion is, however, not legitimate. Even if the population value q is equal to zero, the variables may be dependent (cf 21.7). Various tests of significance adapted to problems of the general character indicated above will be treated in Chs 30—31. The test of significance to be applied to a given problem may always be chosen in many different ways. It thus becomes an important problem to examine the principles underl 3 ^ing the choice of a test, to compare the properties of various alternative tests and, if possible, to show how to find the test that will be most efficient for a given purpose. Questions belonging to this order of ideas will be considered in Ch. 35. In a further type of problems of statistical analysis it is required to use a set of sample values to form estimate}^ of various characteris- tics of the population from which the sample is supposed to be drawn, and to form an idea of the precision of such estimates. The simplest problem of this type is the classical problem of immerse prohalnlity: given the frequency of an event J? in a sequence of repetitions of a random experiment, what kind of conclusions can be drawn with respect to the unknown value of the probability p of E'} It is fairly obvious that in this case the observed frequency ratio may be taken as an estimate of p, but will it be possible to measure the precision of this estimate, and even to make some valid probability statement concerning the difference between the estimate and the unknown »true value» of p? — A more complicated problem of the same character arises in the theory of errors, where we have at our disposal a set of measurements on quantities connected with a certain number of un- known constants, and it is required to form estimates of the values of these constants, and to appreciate the precision of the estimates. Similar problems occur in connection with the method of multiple regression, which is of great importance in many fields of application. In certain economic problems, e. g., economic theory leads us to assume that there exist certain linear or approximately linear relations be- tween variables connected with consumers’ incomes, prices and quantities of various commodities produced or consumed in a given market. When a set of observed values of these variables are available, it is 338 26.4-5 then required to form estimates of the » elasticities » or similar quanti- ties that appear as coefficients in the relations between the variables. A general form of the estimation problem may be stated in the followingf way. We consider a random variable (in any number of dimensions), the distribution of which has a known mathematical form, but contains a certain number of unknown constant parameters. We are given a sample of observed values of the variable, and it is required to use the sample values to form estinjates of the parameters, and to appreciate the precision of the estimates. Tn general, there will be an infinite number of different functions of the sample values that may be used as estimates, and it will then be important to compare the properties of various possible estimates for the same parameter, and in particular to find the functions (if any) that yield estimates of hiaximmn precision. Further, when a system of estimates has been computed, it will be natural to ask if it is possible to make some valid probability statements concerning the deviations of the estimates from the unknown »true values > of the parameters. Problems of this type form the object of the iheorif of estimation, which will be treated in Chs iV2 — — Finally, some ai)plications of the preceding theories will be given in Chs db — 37. 26.5. Prediction. — The word prediction should here be understood in a very wide sense, as related to the ability to answer questions such as: What is going to happen under given conditions? — What consequences are we likely to encounter if we take this or that pos- sible course of action V — What course of action should we take in order to produce some given event'^ - Prediction, in this wide sense of the word, is the practical aim of any form of science. Questions of the type indicated often arise in connection with random variables. We shall quote some examples: What numbers of marriages, births and deaths are we likely to find in a given country during the next year? — What distribution of colours should we expect in the offspring of a pair of mice of known genetical constitution? — What effects are likely to occur, if the price of a certain commodity is raised or lowered by a given amount? — Given the results of certain routine tests on a sample from a batch of manufactured articles, should the^batch be a) destroyed, or b) placed on the market under a guarantee? — How should the premiums and funds of an insurance office be calculated in order to produce a stable business? — What margin of security should be 339 26.5 applied in the planning^ of a new telephone exchange in order t( reduce the risk of a temporary overloading within reasonable limits* If we suppose that we know the probability distributions of th( variables that enter into a question of this type, it will be seen thai we shall often be in a position to give at least a tentative answer tc the question. A full discussion of a question of this type, however, usually requires an intimate knowledge of the particular field of application concerned. In a work on general statistical theory, such as the present one, it is obviously not possible to enter upon such discussions. 340 Chapters 27-29. Sampling Distributions. CHAPTER 27. Characteristics of Sampling Distributions. 27.1 Notations. — Consider a one dimensional random variable S with the d. f. (.v). For the moments and other characteristics of the distribution of i we shall use the notations introduced in Ch. IT). Thus nf and a denote the mean and the variance of the variable, while f(t, ity and '/r denote respectively the moment, central moment and semi-invariant of order r. We shall suppose throughout, and without further notice, that these quantities are finite, as far as they are re(juired for the deduction of our formulae. By n repetitions of the random experiment to which the variable f is attached, we obtain a sequence of )i observed values of the variable: ,/ j. .Co, . . .r„. As exjdained in 25.2, we shall in this connection use a terminology derived from the process of simple random sampling, thus re^Tfardirij^ the set of values ,i*|, . . as a sample from a popula- tion s])ecified by the d.f. F(x). The iHstrihution of the sample is ob- tained (cf 25.3) by placinpf a mass equal to 1/w in each point .r,, and the moments and other characteristics of the sample are defined as the characteristics of this distribution. In all investig^ations dealings with sample characteristics, it is most important to use a clear and consistent system of notations. In this respect, we shall as far as ])Ossible apply the following three rules throughout the rest of the book: 1. The arithmetic mean of anij iimnher of quantities such as x^, .... Xn or //j, . . .yk mill he denoted by the correspondiny letter with n bar: x or y. 2. When a certain characteristic of the popiilatio)i (/. e. of the distri- bution of the variable is ordinarily denoted by a Greek letter, the corresponding characteristic of the sample frill bp denoted by the corre- sponding italic letter: s^ for a^ for etc. 3. In cases not covered by the two jireceding rules ive shall usually denote sample characteristics by placing an asterisk on the letter denoting 341 27.1 the eorrpspoiiding population characteristic , thus u riting e. g. F* {x) fm the d. f. of the sample ^ uhich corresponds to the poptdaiion d.f. F(x). Thus the mean and the variance of the sample are (cf 25.3) (27.1 .1) ^ 2 a.-- - s* = ^ « > w , where the summation is extended over all sample values: / ~ 1, 2, . . . The moments a^ and the central moments w?r of the sample are (27. 1 .2) r/, - 2 7 w/f = - 2 ~ The coelficieiits of skewness and excess of the sample are, in accord- ance with (15.8.1) and (15.8.2), (27.1.3) 9i mj ih = w.; -3. The relations (15.4.4) between the moments and the central inoinents hold true for any distribution; thus in ]>articular they remain valid if Gfr and pv are replaced by the correspondinpf sam])le character- istics Ur and niv. For the d. f. of the sample, we have already in (25.3.1) introduced the notation (.r). Similarly the c. f. of the sample is^) (27.1.4) 00 (0 = / '/ /''* (x) ^ ' 2 .1 K , and the semi-invariants of the sample are thus according* to (15.1(1.2) defined by the development") (27.1.5) = All moments and semi-invariants of the sample are finite, and the relations (15.10.3) — (15.10.5) between moments and semi-invariants When there is a possibility of confusion, Ave shall use a h<‘avy-faced i to denote the imaginary unit. *) At this point our notation differs from the notation of K. A. Fisher (Jtef. 13), who uses the symbol to denote the unbiased estimate of which, in our nota- tion, is denoted hy (cf 27.6;. 342 27.1 hold true when the population characteristics are replaced by sample characteristics. The same rules will be applied to samples from multi dimensional populations. Thus e. if we are given n pairs of observed values ?/h) from a two-dimensional distribution, we write (cf 21 . 2 ) 1 n V == 1 ri 2 .V.'. (27. 1.(5) wiso = 2 (-‘-i - >»,, — /•»•, .Vj = ^ — ./')(//, —//), « . In particular, the quantity r denned by the relation (27.1.7) ; .Vj cSTjj is the correlation coefficient of the sample, which corresponds to the correlation coefficient q of the population. Since r is the correlation coefficient of an actual distribution (viz. the distribution of the sample), it follows from 21.2 that we have — 1 ^ 1. The extreme values r ± 1 can only occur when all the sample points (.y,, y,) are situated on a single straight line. For a sample in more than two dimensions, we use notations de- rived according to the above rules from the notations introduced in Chs 22 — 23. Thus e. g. we denote by Sj the s. d. of the sample values of the ^:th variable, while is the correlation coefficient between the sample values of the ^:th and the j:ih variable. We further write li for the determinant | n-j |, and denote the regression coefficients, the partial correlation coefficients etc. of the sample by symbols such as (cf 23.2.3 and 23.4.2) ^^12 3 4 . K li,. \ ii„ iC 343 27.1 where k is the number of dimensions, while the Bij are the cofactors of R, As before, all relations between the characteristics deduced in Part II hold true when the population characteristics are replaced by sample characteristics. We now come back for one moment to the one-dimensional case. According to 25.4, any characteristic g , . . . , oTn) of an actual sample may be regarded as an observed value of a random variable g{x^ , . . Xn), where x^,...^Xn are independent variables, all having the same dis- tribution as the original variable The distribution of the random variable . . ., Xw) is called the sampling distribution of the charac- teristic ^ (:ri , . . . , rTn). Thus we may talk of the sampling distribution of the mean .r, of the variance s^, etc. The same remarks apply to samples in any number of dimensions. Any sample characteristic may be regarded as an observed value of a certain random variable, the distribution of which is called the sampling distribution of the characteristic. Thus we may talk of the sampling distribution of the correlation coefficient r, of the correlation determinant B, etc. For any sample characteristic g^ we may thus consider its sampling distribution, and calculate the moments, semi-invariants etc. of this distribution. As usual (cf 15.3 and 15.6) we employ in such cases the symbols E{g) and D(g) to denote the mean and the s. d. of the ran- dom variable g = g(xi, . , Xn)- Further, when we are concerned with some characteristic of the <7 distribution (such as a central moment, a semi-invariant etc.), which has been given a standard notation (such as fip or x„) in Ch. 15, we shall sometimes use the standard symbol of this characteristic, followed by the corresponding random variable within brackets. Thus we shall write e. g. for the central moment of order v of the sample characteristic ^ ^ (x^, . . . , xvi) fir{g) = E{{f — E(g))'-. Similarly, when two sample characteristics /(. t ,, . . .,x„) and g(xi, . . .,x„) are considered simultaneously, the correlation coefficient of their joint sampling distribution will be denoted by e(/. .9)== fin if, 9) ^ fit (f)fi»(9) Whenever we arc concerned with sampling distributions connected with a given population, it should always be borne in mind that the 344 27.1-2 sample characteristics (i', s, w?^-, r etc.) are conceived as random vari- ables, tvhile the population characteristics (iw, a, pv, Xv, q etc.) are fixed [though sometimes unlcnotvn) constants. 27.2. The sample mean 1, — Consider a one-dimensional sample with the values . . .,Xn. Regarding the Xi as independent random variables, each having the d. f. we obtain (27.2.1) E(a:.)=^ ’ 2iE(x,) = m, Thus the random variable ll ~ x* has the mean m and the variance u ujn, i. e. the s. d. ajVn. It then immediately follows from Tcheby- cheff’s theorem 20.4 that the sample mean .? converges in probability to the population mean m, as n tends to infinity.^) Writing j — ^ S(.r, — m), and bearing in mind that the xi are u independent, and that any difference x, — m has the mean value zero, we further obtain Hi (./) = £(./ — my = e[2j (•'<■' ~ ”0) ir 21 (afi — m) . i )■ = ^4 21 ^ “ ”0^ + ” ><j — '\Xj — wi)*i ... /^4 , 3(w — 1) , _ .q "i" s 3 fA .1^*- - 3^; The higher central moments of x may be found by similar, though somewhat more tedious, calculations. Thus we find *) I-ty tlie leas elementary Khintchine's theorem 20.6, it follows that this property liolds iis soon as (he population mean m exists, even when is not finite. 345 27.2-3 ^.(x) = E(*-».)»='5f!-'‘’+ 00). + 00 ), and generally (27.2.3) jB(.f-m)"*"' = oQ*)’ = In the important particular case when the distribution of the popula- tion is normal (m, a), it has been pointed out in 17.3 that ,7 is also normal, with mean and s. d. a/V n . It follows that in this case any /ir (^) of odd order is zero, while the three first central moments of even order reduce to 3 a" l“6 (•'■) = 15ff* 27.3. The moments fli. — For any sample moment ay ^ i].7‘Jwe obtain, in direct generalization of (27.2.1) and (27.2.3), (27.3.1) 2 »*(•<) -l'Z{E(xr)-E^x:)) c(> ^ — a ;. 11 E{a.~- By Khintchine’s theorem 20.5 it follows from the lirst of these rela- tions that, as soon as the population moment a,, exists, the sample moment cty converges in probability to as n-^oo, 7t iioiv foUou's from the corollary to theorem 20. G that any rational function, or poivcr of a rational function, of the sample moments ay con- rerges in probability to the constant obtained by substituting throughout ay for Uy, provided that all the ay occurring in the resulting expression exist, and that the constant thus obtained is finite. Hence in particular the central moments niy, the semi-in variants l\ 34G 27 . 3-4 and the coefficients and defined by (27.1.3) all converge in prob- ability to the corresponding’ population characteristics, as In large samples, any of these sample characteristics may thus be re- garded as an estimate of the corresponding population characteristic. We shall, however, later find that the estimates obtained in this way are not always the best that we can obtain (cf 27.6 and 33.1). Any mean value of the type {27.3.2) E («;; al . . .) = E )" ' where p, q, . . . are integers, can be obtained by straightforward, though often tedious, algebraical calculation. We have only to use the fact that the Xt are independent variables such that E (x^) = a,.. — Ill the particular case when the population mean m is equal to zero, Ov co- incides with the central moment If the sample mean occurs among the factors in (27.3.2), the calculations are in this case simpli- fied, since any term containing one of the Xf in the first degree has then the mean value zero. 27 . 4 . The variance m^. — Any central samjde moment ^ 2 independent of the position of the origin on the scale of the variable. Placing the origin in the mean of the popula- tion, we have m — 0. When we are concerned with the sampling distributions of the vh, we may thus always suppose = 0, and so introduce the simplification mentioned at the end of the preceding paragraph. The formulae thus obtained will hold true irrespective of the value of m. We accordingly suppose vi = 0, and consider the sample variance (r, - J)* = a, - .f*. By (27.2.1) and (27.3.1) we have, w since m — 0, (27.4.1) E(m,) - E(a,) E(r-) = ^ ^t,. We further have ml al — 2 r/g + Assuming always m 7 =0, we find 347 27.4 Ml + (« — 1)/^! £(«’) = E (./-* a,) = E 2 37/ 1 = M4 + {« — i)(*l «_/*, + i){u — !)/«’ and hence after reduction E (wi j) = + ^ a ‘4 — /<; — 3 ni (27.4.2) Z>* (»Hs) = E (w’) — £* (w*) . 1 ^* ~ TT-li^y 4 ^./i? 11 The higher central moments of may be obtained in the same way. The calculations are long and uninteresting, but no dilficulty of prin- ciple is involved. We give only the leading terms of the third and fourth moments: We shall finally consider the covariance (cf 21.2) between the mean i and the variance w ?2 fbe sample. For an arbitrary value of w?. this is i“ii wig) ~ E |(.r — m) {rii^ — ^ j = E ((./‘ — lu) nu). Since the last expression is clearly' independent of the position of the origin, we may again assume m = 0, and thus obtain by calculations of the same kind as above /iji {J\ mg) E (J' ///g) = E (./■■ r/g) — E (./ ®) (27.4.4) — 1 11 n‘ 348 27.4-5 For any syuimetric distribution, we have ^3 = 0, and thus i and w/g are uncorrelated. We shall see later (cf 29.3) that, in the particular case of a normal population, ./ and are not only uncorrelated, but even independent. For a normal population, (27.4.1) and (27.4.2) ^ive (27.4.5) E (yy/g) ^ ~ 27.5. Higher central moments and semi -invariants. — The ex- j)ression8 for the characteristics of the sampling distributions of w?, and kv are of rapidly increasing complexity when v becomes greater than 2, and we shall only mention a few comparatively simple cases, omitting details of calculation. For further information, the reader may be referred e. g. to papers by Tschuprow (Ref. 227) and Oraig (Ref. 67). By calculations of the same kind as in the preceding paragraphs, we obtain the expressions ( 27 . 5 . 1 ) E [ni^) in - 1) (H - 2) X (» — !)(«“ — 3w + 3) , 3(» — 1)(2« — 3) ,, E{iiu)= n, + - fi,. For any w?, we have As before, we may suppose — 0, so that E(a^) = pv, and E (j E (2 ’)='„’ • For 1 < / ^ r. we have by (27.2.3) and (27.3.1), using the Schwarz in- equality (9.5.1), £*(•?'«—•) ^ E(J^')E(al-,) = 0 (^). 80 that E(a' a,-i) = 0 (« *), and (27.5.2) gives 349 27.5 (27.5.3) E (my) ~ II, -f O Further, by (27.5.2) any power of — fiv is composed of terms of the form (a, ~ ^i,.)^ i/a, i/a, . • • , and it is shown in the same way as above that _ ‘~tJ the mean value of such a term is of the order h . Thus in order to calculate the leadingr term of it is sufficient to retain the terms ///, — Ur ™ II, — /!,. - - / II, -I , while all the followinjj terms of (27.5.2) ^dve a contribution of lower order. For /■ -- 2 we obtain in this way, since by (27.5.3) the diffe- rence E -- fi, is of order v' ^ (27.5.4) D- («!,.)=■— 2 r /I, - 1 Liy 1 1 — ill + fil-i n (Generally we obtain for any even power of /^i, ~ ii, (27.5.5) E{mv — /i,)-^ -- O • The mean value of a product [my — fiy) {ni, — u,) may be calculated in the same way, and we thus obtain, usin<^ a<jain (27.5.3), the following* expression for the covariance between w?, and ui,,: (27.5.()) ^1,1 (Wir, W/o) = ^1, 4 — r /I, - 1 4 1 — p iU» + 1 + I' Q flQ -1 ‘ii The expressions of the first semi-invariants ky of the sample are obtained by substituting in (15.10.5) the sample moments w?r for the population moments (ly. We obtain k^ = k^ = W2j„ A*4 == — 3 m, . We may then deduce expressions for the means and variances of the ky by means of the formulae for the wi,. given above. In particular we obtain in this way, expressing E [ky) in terms of the population semi- in variants X, 350 27.5-6 (27.5.7) p(*. \ “ 1 ) (</ — 2 ) E (^a) ' *^3 > £(^,) (/z — 1 ) (>/“ — tu/ + G) itS(n — 1 ) ^ , X . 27.6. Unbiased estimates. — Consider the sample variance nu = -= ^ According* to 27.3, converges in probability to the n population variance as -> oo , and for large values of n we may thus use as an estimate of In the terminology introduced by R. A. Fisher (Ref. 89, 9()), an estimate which converges in probability to the estimated value, as the size of the sample tends to infinity, is called a coihsistmt estimate. Thus is a consistent estimate of On the other hand, it is shown by (27.4.1) that the mean value 71 I of 771 ., is not Uo but Thus if we repeatedly draw samples of 77 a fixed size if from the given population, and calculate the variance ni., for each sample, the arithmetic mean of all the observed wi^-values will 72 ot converge in probability to the ^^true value /io, but to the smaller value- - u^. As an estimate of the quantity nu is n thus affected with a certain negative bias, which may be removed if we replace by the quantity 71 - 1 77U = 71 — 1 We have, in fact, E (3f^) == E (m^) ===- fu, and accordingly 31^ is called an imhiased estimate of u«. Since the factor tends to unity 71—1 as 71 oo , both il/g and mg converge in probability to /Ug , so that 71/g is consistent as well as unbiased, while is .consistent, but not un- biased. Similarly, by 27.3, any central moment w?*. or semi-invariant kr of the sample is a consistent estimate of the corresponding fiv or x,., 351 27,6-7 but it follows from (27.5.1) and (27.5.7) that for v > \ these estimates are not unbiased. As in the case of we may, however, by simple corrections form estimates which are both consistent and unbiased. Thus we obtain for v = 2, 3 and 4 the following corrected estimates of fiv and X, : and n(n^ — 2n + 3) ^^4 = . — iw .-.vu — 3 n (2 n — 3) (w - 1 ) (w ~ 2) (n - 3) " (^/ - 1 ) (n - 2) (« ~ 3) M — 1 m;, K, - (^■- !)■(/.- 2 ) ^ (^^ — i ) (n — 2) (71 — 3) [(w + 1) 7 ) 1 ^ — 3 ( 7 / — 1) m“]. By means of the formulae given in the two preceding paragraphs, it is easily verified that in all these cases we have E{Mv) = fiv and E (K^) = For large values of w, it is often indifferent whether we use Mv and Kv, or nxv and Tcv, but for small n the bias involved in the latter quantities may be considerable. — We shall return to questions connected with the properties of estimates in Ch. 32. We have seen in the preceding paragraphs that the algebraical process of working out formulae for the sampling characteristics of the quantities and Ic^ becomes very laborious, as soon as we leave the simplest cases. It has been discovered by R. A. Fisher (Ref. 99), who has introduced the quantities (which he denotes by cf foot- note p. 342), that the corresponding calculations for the Kv may be considerably simplified by means ' of combinatorial methods. These methods have been further developed by Fisher himself, Wishart and others. A good account of the subject has been given by Kendall (Ref. 19), who gives numerous references to the literature. 27.7. Functions of moments. — It often occurs that the mean and the variance of some function of the sample moments are required. 352 27.7 When the function is a polynomial in x and the central moments the problem can be solved by the method developed in 27.3 — 27.5. Even when fractional powers are involved, we may often use a similar direct method. Consider e. g. the simple example of the standard deviation s = Vm^ of the sample. We have identically \r;r __ __ ^2 — 2 V Va (I 4- V By (27.4.1), the first term in the second member has a mean value of (w order The last term is smaller in absolute value than - — — - 7 ” » and thus by (27.4.2) and (27.4.1) its mean value is also of order Thus we obtain (27.7.1) £(>' + o(^^J. By a similar calculation we obtain (27.7.2) = In Jiiany cases, however, we are concerned with functions involving ratios between powers of certain moments, such as the coefEicients and gTg, the coefficient of correlation etc. We shall give a theorem that covers the most important of these cases. The theorem will be stated and proved for the case of a function H[mr, w?p) of two central moments and but is immediately extended to any number of arguments, including also the mean j. The case of a function of one single argument is, of course, included as the particular case when the function is independent of one of the two arguments. The theorem also holds, with the same proof, for functions of moments of multi- dimensional samples (cf 27.8). Consider a function H(mv, m^) which does not contain n explicitly. We may regard H either as a function of the two arguments m<v and t/ip or, replacing and wip by their expressions in terms of the sample values, as a function of the « variables . . . , Xn. In the latter case the function may, of course, contain n explicitly. — We shall now prove the following theorem: Suppose that the two following conditions are satisfied: 1) In some neighbourhood of the point — = /ip, the function H is continuous and has continuous derivatives of the first and second order with respect to the argmnents and Wq. 23 — 464 H. Cramer 353 27.7 2) For all possible values of the Xi, tve have \B\ < trhere C and p are non-negative constants. Denoting by i/o> the values assumed by the function Z/ (w 4 , m^,) and its first ordei* partial derivatives in the point yvy = , Pq, the mean and the variance of the random variable H[mr, m,) are then given by = o(]f (27.7.3) ' ’ D^(H) = /to(wi!r) H'' 4 - 2 a,, (w/,, m,) 7/, 4- Hi 4 - O By (27.5.4) and (27.5.6), the variance of II is thus of the form r n 4- (n'""/- , where c is constant. — The proofs of these relations found in the literature are often unsatisfactory. The condition 2) as given above may he considerably generalized, but Home condition of this type is necessary for the truth of the theorem. In fact, if we altogether omit condition 2), it would e. g. follow that, for any population with ,M 2 > 0, the function 1/?W2 would have a mean value of the form 1///2 4- This is, how'ever, evidently false. The mean of I/W 2 cannot be finite for any ]jopulation with a distribution of the discrete type, since we have then a positive probability that w /2 ” 0. It is easy to show that similar contradictions may arise even for con- tinuous distributions. Tn 28.4, it will be proved that the function H (m^, is asymptotically nor mally distributed for large values of n. It is interesting to observe that, in this proof, no condition corresponding to the present condition 2) will be reejuired. Let F(S) denote the pr. f. of the joint distribution of ic,, x'g, . . ., Xn- P{S) is a set function in the space JR„ of the Xi. If, in Tchebycheff’s theorem (15.7.1), we take g(^)~{mv — it follows from (27. 5. .5) that we have for any f > 0 or where ^ is a constant independent of n and e. The corresponding result holds, of course, for Denote by Z the set of all points in Rh such that the inequalities — pv \ < e and | £ are both satisfied, while Z* is the complementary set. We then have, according* to the above, (27.7.4) M* e** ir 354 27.7 Now HdP + f HdP, y. z* and condition 2) the modulus of the last integral is smaller than 2 A C Choosing /r > ;> + 1 , it follows that (27.7.5) H(IF+ 0 If £ is sufficiently small, we have by condition 1) for an> point in the set Z H (w?,, )n,) -- //q + (w2v — ft,.) 4- //jj — a^) 4 ZV, (27.7.0) J [//n (>y/, --/t,)“ 4 2i/'i2(>??,. — ft,.)(yy/j, — 4 i /22 — where the denote the values of the second order derivatives in some intermediate point between (/t, , ft^) and w^p). Hence f HdP~=ll^P(Z)+ f{m.— fi,)dP + ( 5 ^ 7 . 7 . 7 ) + //. / (w(„ — (i.) dP + j E dP. Consider now the terms in the second member of the last relation. By (27.7.4), the first term differs from Hq by a quantity of order n~^j which is smaller than since ^ > p 4 1 ^ 1 . The two following terms are at most of order since //j and are independent of n, and we have by (27.5.3) and (27.5.5). using the Schwarz inequality (9.5.1), ^ [m, — fir) (IF — E [nir — fl^) — f (//^ — /«r) (IF X z* fly) (I F , 355 27.7 X* X* X* ^ [E{w, - fuY- ■ p (;?*)]* = o C" ) , and similarly for the term containing Finally, by condition l)the derivatives Hjj are bounded for all sufficiently small and it then follows in the same way that the last term in (27.7.7) is also of order Hence the first member of (27.7.7) differs from Hq by a quantity of order and according to (27.7.5) we have thus proved the first relation (27.7.3). In order to prove also the second relation (27.7.3), we write E{H- i/o)- =/■(//- HoY dP+ fiH- H^Y dP. X X* Choosing now 2/; + f, we obtain by means of condition 2) and the first relation (27.7.3) -just proved D» (ff ) == / {H - HoY dP+ 0 (»r •"). y We then express {H — i/o)“ by means of the development (27.7.6), and proceed in the same way as before. The calculations are quite similar to those made above, except with respect to the terms of the type J (n?, — where we have, e. g., using (15.4.6) and (27.5.5), X \j H’u{nu < A'E(|w, -^.|*) S ir(£w,.-ju.)V-= ^(n-'). This completes the proof of the theorem. We shall now apply the relations (27.7.3) to some examples. Con- sider first the coefficients of skewness and excess of the sample: 9i m 3. As soon as > 0, these functions satisfy condition 1). In order to show that condition 2) is also satisfied, we write 356 27.7 (Xi — a)* J and hence infer I iV) ,, ^ ”(S (x. - ./)*)■'*”■ li/i I S ^ « V n ^1 j£.' — ./•)’* n • In a similar way it is shown that | ^2 I < ^or all n > 3. — Thus we may apply (27.7.3) to find the means and the variances of and g^. From (27.5.4) and (27.5.6) we find, to the order of approximation g-iven by (27.7.3), yi. =- ^2. fj - ^ ^ ^ ^8 ~ ~ i“4 + /d iM4 + 35 fl l + ae (27.7.8 4/^i|w ^ _ '^ 7 .«« — 4 A<2 Ui /Me — 8 .m| /<3 .Ms + 4 /m;---/M^ ^][+ 16 //g /m| ^<4 + 1 « /wj |M5 When the parent population is normal, these approximate expressions reduce to (•27.7.9) E(ff,) = E(ff,) = 0, The exact expressions for the normal case will be given in (29.3.7). As our next example we consider the ratio X x which is known as the coefficient of rariaiiou of the sample. When the po])ulation distribution is such that the variable takes only post' tire values, we have so that we may apply (27.7.3), replacing, in accordance with the re- mark made in connection with the theorem, niv by x. By (27.2.1), 357 27.7-8 (27.4.2) and (27.4.4) we then obtain, to the order of approximation given by (27.7.3), (27.7.10) E(r)^ a m ' D^V) fA ) — + 4|uj 4 m* ^2 n A normal population does not satisfy the condition that the variable takes only positive values, and it is easily seen that for such a po- pulation V is not bounded, so that condition 2) is not satisfied. We may, however, consider a normal distribution truncated at x = 0 (cf 19.3), and when ^ is fairly small, the central moments of such a dis- tribution will be approximately equal to the corresponding moments of a complete normal distribution. In this case, the approximate ex- pression for the variance of V reduces to (27.7.11) = + s'"!)- ^ 2m“w\ 27.8. Characteristics of multi- dimensional distributions. — The formulae for sample characteristics deduced in 27.2 — 27.6, as well as the theorem proved in 27.7, may be directly extended to the character- istics of multi-dimensional samples. The calculations are quite similar to those given above, and we shall here only quote some formulae relating to the two-dimensional case. The definitions of the symbols used below have been given in 27.1, and we assume throughout that all the requisite moments are finite. — We have E (mi k) fiik 0 j » Fn(m*o. + 0 . Fn (»»n , «'«) - + 0 ) • 358 27.8-9 The sample correlation coefficient WljQ WIqJ obviously satisfies the conditions of the theorem of 27.7, since we have I r I ^ 1 . Denoting by q the population value of the correlation coefficient, we then obtain by means of the relations given above, to the order of approximation given by (27.7.3), E{r) = 0, (27.8.1; ^ o' /fUo 2 /Ma t ^ ^22 __ _4jWsi 4 n /I 20 /Mot iWii f^n /“to .^n /Moa/’ For a normal population, the expression for the variance reduces (cf Ex. 3, p. 317) to the following expression, which is correct to the order (27.8.2) D*(>-)= ~ n We finally observe that the theorem of 27.3 on the convergence in probability of sample characteristics holds true without modification in the multi dimensional case. Thus e. g. r converges in probability to Q, while the partial correlation coefficient 7*12 34. .k of the sample converges in probability to P 12 . 34 .. it, etc. 27.9. Corrections for grouping. — In practice samples are very often grouped (cf 25.5). Suppose that we draw a sample of n from a one-dimensional distribution of the continuous type, with the fr. f. f{x), and let the sample values be grouped into intervals of length h, with the mid points where i = 0, ± 1, ± 2, . . . . In such cases it is usual to assume, in calculating the moments and other sample characteristics, that all sample values belonging to a certain interval fall in the mid-point of that interval. We are then in reality sampling from a distribution of the discrete type, where the variable may take any value ^ ^ with the probability f f[x)dx. The moments etc. that we are estimating from our sample character- 359 27.9 istics according to the formulae previously given in this chapter, are thus the moments of this ^grouped distribution » : 00 a,. = ^Pi^r — 00 However, in many cases it is not these moments that we really want to know, but the moments of the given continuous distribution: 00 a,. = j x''f[x)dx. — 00 Consequently it becomes important to investigate the relations be- tween the two sets of moments. It will be shown that, subject to certain conditions, approximate values of the moments a, may be obtained by applying certain corrections to the rafr or grouped mo- ments OLy. The raw moments may be written a Ob dx — ® where = Jo + « and i \ ft (27.9.1) .7(£)=i'’' / f[x)dx. i-lh From the EuIer-MacLaurin sum formula (12.2.5) we then obtain, assuming f[x) continuous for all x. (27.9.2) 00 “r = / (5o -I- j f[x) dx + T{, 00 R=^ —h j (//) g (£„ + h //) d »/ . — 00 Let us assume for the moment that the remainder H may be neg- lected. We then obtain, reverting the order of integration, 860 27.9 *• = J f{x) dx / do + hyY dy -00 h" * (a: 4 — lx - - \ 2 / \ 2/ h(v 4- 1) f{x) dx Thus the grouped moments a,, may be expressed as linear functions of the »true» moments a,. Solving the equations successively with respect to the or,, we obtain «i = aj, cfg = dtg — A®, = aa ' i (27.9.3) — 1 X 4 — ^ (Xg -f gjo A*, «5 = a* - e aa + ig *1 = a^n j ^4 A* + i J| y A®, These are the formulae known as Sheppard's corrections (Ref. 212). The general expression is (cf Wold, Ref. 245) t=-o ' ' where the are the Bernoulli numbers defined by (12.2.2). If we place the origin in the mean of the distribution, we have cTj — dti = 0, and so obtain the corrections for the central moments: ^ / (27.9.4) = = fit — I /ij A* + ii(, h*. 361 27.9 These relations hold under the assumption that the remainder B in (27.9.2) may be neglected. Suppose now that we are given two positive integers s and h such that: 1) f[x) and its first 2^ derivatives are continuous for all x. 2) The product is bounded for all x and for i = 0, 1, . . . , 2 j?. — The function g (^) given by (27.9.1) will then be continuous for all 5 together with its first 2^ + 1 derivatives, and it is easily seen that for y = 1 , 2, . . . , X; and i = 0, 1 , . . . , 2 a* + 1 we have {27.9.5) 5<‘>(s^) = 0(r=) as ? ± «>. Consequently we may apply the Euler MacLaurin formula in the form (12.2.6), and thus find that the remainder R may be written in the form — 00 It then follows from (12.2.1) and (27.9.5) that we have oo where A and B are constants not depending on h. Thus if h, the width of the class interval, is sufficiently small, R may be neglected and the corrections (27.9.3) or (27.9.4) applied to moments of any order v^Jc, the error involved being of the order Whenever the frequency curve y =f{x) has a contact of high order with the :r>axis at both ends of the range, the above conditions 1) and 2) are satisfied for moderate values of ^ and 1c , In such cases, it has been found in practice that the result of applying Sheppard’s correc- tions to the moments is usually good even when h is not very small. It is, however, always advisable to compare the amount of the correc- tion to be applied to a certain moment with the standard deviation of the sampling distribution of that moment. If, as is often the case, the correction only amounts to a small fraction of the s. d., it does not really matter whether the correction is applied or not. In cases where the frequency curve has not a high order terminal contact, it is usually better not to apply Sheppard’s corrections. Other correction formulae have been proposed for use in such cases, but they do not seem to be of sufficiently general validity (cf Elderton, Bef. 12, p. 231). 362 27.9-28.1 Langdon and Ore (Ref. 144) and Wold (Ref. 245, 246) have given corrections for the semi-invariants which are valid under the same conditions as Sheppard’s. These have the simple form , and Xv = Xr (v > 1). The deduction of Sheppard’s corrections may be extended to mo- ments of multi dimensional samples. In particular we have for a two- dimensional distribution with class intervals of the length for x and Aj for y (27.9.6) fill filly fiil fiity fisi — ^31 i fill » fx^a = A fi 20 ^2 iV Pot Aj + ii ‘4 A| Ag . The corrections for and fii^ are, of course, obtained by permutation of indices, and the corrections for the marginal moments fho and fioj follow directly from (27.9.4), so that by these formulae we are able to find the corrections for all moments of orders not exceeding four. It should finally be remarked that the problem of corrections for grouping has been treated also from various other points of view. The reader may be referred e. g. to Fisher (Ref. 89) and Kendall (Ref. 136). CHAPTER 28. Asymptotic Properties of Sampling Distributions. 28.1. Introductory remarks. — In 27.3 and 27.8, w^ have seen that all ordinary sample characteristics that are functions of the mo- ments converge in probability to the corresponding population char- acteristics, as the size n of the sample tends to infinity. In the present chapter, the asymptotic behaviour for large n of the sampling distri- butions of these and certain other characteristics will be considered somewhat more closely. Following up a remark made in 17.5, we shall first show that, under very general conditions, characteristics based on the sample moments are asymptotically normally distributed for large n. We shall then consider certain other classes of sample char- acteristics, some of which are, like the moment characteristics, asymp- 363 28,1“2 totically normal, while others show a totally dijBFerent asymptotic behaviour. 28.2. The moments. — Consider n sample values . .^Xn from a onedimensional distribution. The quantity nar=^^x\ is a sum of i n independent random variables x*, all having the same distribution,, with the mean E(x]) = cr^ and the variance = 02 ^.— al. We may then apply the Lindeberg-L6vy case of the Central Limit Theorem (cf 17.4) and find that, as «->«», the d. f. of the standardized sum 2 x^ — n y n ( 02 ,. — al) V — av ^ ct2v — at- tends to the normal d. f. fl>(x). According to the terminology intro- duced in 17.4, any sample moment is thus asymptotically normal (cfr, K(a 2 »“-al)/n). We observe that the parameters of the limiting normal distribution are identical with the mean and the s. d. of a,, as given by (27.3.1). — In particular, the mean = .f of the sample is asymptotically normal (w, alV n), as already pointed out in 17.4. Similarly, when we consider simultaneously the two random vari- ables n flr ^ 2 x^. and n == I" x^, an application of the two-dimensio- nal form of the Lindeberg-Levy theorem (cf 21,11) shows that the joint distribution of the two variables V n[av — av) and Vn[aQ — tends to a certain two-dimensional normal distribution. The argument is evidently general, and by means of the multi-dimensional form of the Lindeberg-Levy theorem (cf 24.7) we obtain the following result: The joint distinhution of any number of the quantities V n[av — «*.)• tends to a normal distribution with zero mean values and the second order moments Avr = al E {n (Uv — Uv)*) = a 2 v — , (28.2.1) = E(n (av — av)(a^ — a^,j) == «v+« — «,■ Thus if we introduce standardized variables Zv defined by (28.2.2) av = Ur + Ov V n Zv. every Zv will have zero mean and unit s. d., and the joint distribution of the Zv will be asymptotically normal, with the covariances 364 28.23 <Ji» (7rt The extension of the above considerations to moments of multi-dimen - sional samples is immediate. 28.3. The central moments. — By the remarks made in connection Avith (27.5.2), any central moment may be written in the form ir = Or — V j flr,-! + » )t where iv is a random variable such that E(iv^) is smaller than a quantity independent of n. According^ to 27.4, Ave may without loss of generality assume 7n — 0, so that Ov — and — Uv (1% — civ - V d flr-l + 71 Introducing the standardized variables Zv defined by (28.2.2), we then have H, (28.3.1 ) \ 71 (rrir --- = Or — V Oj fiv~ J ^ 1 + , 1 )/ where B — ir — v a^Ov-^iZ^Zr-i- Now by (9.5.1) E(\ll\)^E(\ 7v \)-bva, E (| 1 ) VE(7fr) + vaiUr-^y E(z]) EJzI^j), so that £(li^|) is smaller than a quantity independent of //, and it then follows by an application of TchebychefF’s theorem (15.7.1) that BlV 71 converges in probability to zero. Applying the theorem 20.0 to the expression (28.3.1) we thus find that the variable V n(mv — fdt) has, in the limit as v oo, the same distribution as the linear ex- pression OvZv — The joint distribution of Zv and z^ is, however, asymptotically normal, and any linear combination of nor- mally distributed variables is, by 24.4, itself normally distributed. Thus' any central mome 7 it w,. of ths sample is asymptotically nor7nally distributed, ivith the mean pv and the variaiice ' 1 ^ 1 + v^ o]fil - 1 _ piv— 2v fj Lv-\ 1 1 pi + th ffv-i n 71 365 We observe that the variance of the limiting normal distribution is identical with the leading term of D*(wir) as given by (27.5.4). — If we consider simultaneously any number of the m,., we find in the same way, using the last theorem of 22.6, that the joint distribution of the m*. is asymptotically normal, with the means fiv, and variances and covariances given by the leading terms of (27.5.4) and (27.5.6). — As in the preceding paragraph, the extension to moments of multi- dimensional samples is immediate. 28.4. Functions of moments. — As in 27.7, we shall confine our attention to the case of a function H (w^., w/,,) of two central moments from a one-dimensional sample. However, the extension to any number of arguments, to multi dimensional samples and to the joint distribu- tion of any number of functions is immediate. We shall prove the following theorem. If\ ill some 'ueighhourhood of the point nir — firy ~ the function II (nir, ni(^ is continuous and has continuous derivatives of the first and second order with respect to the arguments and yn^,, the y'ayidoni variable If[mr, is asymptotically noryttal, the mean and the varianee of the liniitiiuj normal distribution being given by the leading terms o/' (27.7.3). It will he ob.served that in ILit* theorem there is nothing corresponding to condi- tion 2) of the theorem of 27.7. Thus we may e g. assert that, the function ^ is fn.2 asymptotically normal ( ^ the mean nor the variance of ^ is finite. We remind in this connection of a remark W.2 made in 17.4 to the effect that a variable may be asymx>totically normal even thongb its mean and variance do not exist, or do not tend to the mean and variance of the limiting normal distribution. As ill 27.7, we consider the set Z of all points (a?j, . . ., Xn) such that I nir — fiy \ < € and \mg — fi^,\ < e. In the present case we shall, however, allow e to depend on n, and shall in fact choose e = n~^. We then have, using the notations of 27.7 and choosing Jc = 1, P (Z) > 1 - - 1 - 2 .-1 If n is sufficiently large, we have for any point of Z the development (27.7.6), which may be written though for certain populations cf 27.7) neither 2B.4~5 where | II \ < Ks^V^n = Kn'~^. Thus the inequality \ BVn\ is satisfied with a probability ^ P(Z) > 1 — 2il so that RV v converges in probability to zero. By theorem 20.6, we then find that the variables Vn(H-- H^) and V n [m^ — H^V n [m^ — have, in the limit as w oo , the same distribution. By the preceding para- graph, the latter variable is, however, asymptotically normal with the mean and the variance required by our theorem, which is thus proved. It follows from this theorem that any sample characteristic based on moments is, for large values of «, approximately normally distributed about the corresponding population characteristic, with a variance of the form cjn, provided only that the leading terms of (27.7.3) yield finite values for the mean and the variance of the limiting distribution. This is true for samples in any number of dimensions. Thus e. g. the coefficients of skewness and excess (15.8), the coefficients of re- gression (21.6 and 23.2), the generalized variance (22.7), and the coefficients of total, partial and multiple correlation (21.7, 23.4 and 23.5) are all asymptotically normally distributed about the corresponding coefficients of the population. One important remark should, however, be made in this connec- tion. In general, the constant c in the expression of the variance will have a positive value. However, in exceptional cases c may be zero, which implies that the variance is of a smaller order than Looking back on the proof of the theorem, it is readily seen that in such a case the proof shows that the variable V n [H — II^ converges in probability to zero, which may be expressed by saying that H is asymptotically normal tvith zero variance, as far as terms of order n~^ are concerned. It may, however, then occur that some expression of the form )i^[H — with > J may have a definite limiting distri- bution, but this is not necessarily normal. We shall encounter an example of this phenomenon in 29.12, in connection with the distribution of the multiple correlation coefficient in the particular case when the corresponding population value is zero. 28 . 5 . The quantiles. — Consider a sample of n values from a one- dimensional distribution of the continuous type, with the d. f. F{x) and the fr. f. f[x) = F' [x). Let C = Cp denote the quantile (cf 15.6) of order p of the distribution, i. e. the root (assumed unique) of the equation i^’(C)=i?, where 0<p < 1. We shall suppose that, in some neighbourhood oi x = ^p, the fr. f. f(x) is continuous and has a con- tinuous derivative f (x). 367 28.5 We further denote by Zp the corresponding quantile of the sample. If np is not an integer, and if we arrange the sample values in ascending order of magnitude: there is a unique quantile Zp equal to the sample value where itt = [w^] denotes the greatest integer ^ np. If wp is an integer, we are in the in- determinate case (cf 15.5 — 15.6), and Zp may be any value in the interval (x»p, Xnp+i). In order to avoid trivial complications, we assume in the sequel that np is not an integer. Let g (x) denote the f r. f . of the random variable z = Zp. The probability g(x)dx that z is situated in an infinitesimal interval {xy X dx) is identical with the probability that, among the n sample values, fi = [np] are < x, and n — p — 1 are > x f dx, while the remaining value falls between x and x + dx. Hence g (x) dx= j (« — ft) (Fix)}^ ( 1 — ’/(.r) dx. In order to study the behaviour of the distribution of z for large n, we consider the random variable y = ^ n/pqf('^){z — ^), where (j = l —p. By (15.1.2) y has the fr. f. 1 — Ai A2 /Ig, where we have for any fixed x us n co (cf 16.4.8) ^ n \n)^ q where t - 1 A,= / PQ ' .m fl - F{t)\ \ / \ Q / Now = p, and thus » AC) F(t) ^p + x Z|) + 0 Q . Substituting; this in the expression of we find after some calculation 368 28.5 so that the fr. f. of y tends to the normal fr. f* e 2 . It is also seen that -Aj, and are uniformly bounded in any interval a < X <. h, so that by (5.3.6) the probability of the inequality a<.y<h h 1 C tends to the limit - 7 =: I « 2 dx. a It folloivs' that the sample quantile Zp is asymptotically normal (c, where C = Cp Ihe corresponding quantile of the popula- tion. — In particular the median of the sample is asymptotically normal 7 I» where C = Ci is the median of the population. For a normal distribution, with the parameters m and o, the median is m, and we have f{m) — * Thus the median - 2 ^ of a sample of <7F27r n from this distribution is asymptotically normal On the other hand, >ve know that the mean x of such a sample is exactly normal , ^_Y — As n — GO, z and x both converge in probability to w, and for large ("■ valnes of n we may nse cither z or x as an estimate of m. The latter estimate should, however, be considered as having the greater precision ^ since the s. d. 7 - corresponding V n 1 / ^ to X is smaller than the s. d. ® j/ 2 “ 1.2688— - corresponding to r. — A systematic comparison of the precision of various estimates of a population characteristic will be given in the theory of estimation (cf. Ch. 32). Consider now the joint distribution of two quantiles / and z\ of orders p^ and p^^ where p^ < 2h' ^ calculation of the same kind as above, it can be shown that this distribution is asymptotically normal. The means of the limiting normal distribution are the cor- responding quantiles and of the population, while the asymptotic expressions of the second order moments (/, /'), are J>igt PiQt Pag* «/(n/(n’ 24 — 454 H. Cramer 28.5-6 Choosing in particular pj = i, and are the lower and upper quartiles of the population, and we find that the semi-inter- quartile range (cf 15.6) of the sample, J (^" — e'), is asymptotically distributed in a normal distribution with the mean J a^id the s. d. sVnV /‘(C) Aonn r(n — For a normal (m, a) population, the mean of the semi-interquartile range becomes 0.6746 o, and the s. d. 0.7867 y • V n 28.6. The extreme values and the range. — So far, we have only considered sample characteristics which, in large samples, tend to be normally distributed. We now turn to a group of characteristics showing a totally different behaviour. In a one-dimensional sample of n values, there are always two finite and uniquely determined extreme values^ and also a finite range^ which is the difference between the extremes. More generally, we may arrange the n sample values in order of magnitude, and consider the v:ih value from the top or from the bottom. For r = 1 we obtain, of course, the extreme values. It is often important to know the sampling distributions of the extreme values, the r:th values, the range, and other similar charac- teristics of the sample. We shall now consider some properties of these distributions. We restrict ourselves to the case when the population has a distri- bution of the continuous type, with the d. f. F and the fr. f . /= Let X denote the r:th value from the top in a sample of n from this population. The probability element gy(x)dx in the sampling distribu- tion of X is identical with the probability that, among the v sample values, n — V are < a:, and y — 1 are > x -h dx, while the remaining value falls between x and x -{■ dx. Hence (28.6.1) gy[x)dx=^ ^ j (F(a7))”“’'(l — F{x)y'~^f[x)dx, If we introduce a new variable I by the substitution ') If, e. g., the two uppermost values are equal, any of them will he considered as the upper extreme value, and similarly in other cases. 370 28.6 (28.6.2) g = n(l~.F(x)), we shall have 0 ^ ^ S and the fr. f . hy (f) of the new variable will be ( 28 . 6 . 3 ) for 0 g ^ M, and /i, (|) = 0 outside (0, n). As « -» oo , ft, (f) converges for any | ^ 0 to the limit (28.6.4) Further, hv[^) is uniformly bounded for all n in every finite f-inter- val, and thus by (5.3.6) § is, in the limit as W“>oo, distributed ac- cording to the fr. f. (28.6.4), which is a particular case of (12.3.3). Similarly, if y denotes the y.th value from the bottom in our sample, and if we introduce a new variable 7} by the substitution (28.6.5) 7] = nF[y\ we find that t] has the fr. f. hv(7j) and thus, in the limit, the fr. f . rW" • We may also consider the joint distribution of the y:th value x from the top and the v:th value y from the bottom. Introducing the variables ^ and t] by the substitutions (28.6.2) and (28.6.5), it is then proved in the same way as above that the joint fr. f . of ^ and t] is where ^ ^ > 0, g ^ and 2v < n. As « -> oo , this teijids to (28.6.7) r{v'^ ■r(y) SO that ^ and t] are, in the limit, independent. When the d. f. F is given, it is sometimes possible to solve the* equations (28.6.2) and (28.6.5) explicitly with respect to x and y. We then obtain the v:th values x and y expressed in terms of the auxiliary variables ^ and t] of known distributions. When an explicit solution cannot be given, it is often possible to obtain an asymptotic solution for large values of w. In such cases, the known distributions of £ 371 38.6 and 7] may be used to find the limiting forms of the distributions of the j':th values, the range etc. We now proceed to consider some examples of this method, omitting certain details of calculation. 1. The rectangular distribution. — Let the sampled variable be uniformly distributed (cf 19.1) over the interval (a, h). If, in a sample of n from this distribution, x and y are the v.th values from the top and from the bottom, (28.6.2) and (28.6.5) give X z=h — h a y = « + — 7], n where ^ and rj have the joint fr. f. (28.6.6), with the limiting form (28.6,7). Hence we obtain and similar expressions for y. We further have (28.6.8) which shows that the arithmetic mean of the r:th values x and y provides a consistent and unbiased estimate (cf 27.6) of the mean (a + b)/2 of the distribution. Finally, we have for the difference x — y (28.6.9) EU-,) - (l - For y = 1 the difference x — y is, of course, the range of the sample. 2. The triangular distribution. — In the case of a triangular distri- bution (cf 19.1) over the range (a, 6), the equations (28.6.2) and (28.6.5) , a H- 6 - a b give, when x > — -- and y < ' ^ ’ X = h- [h - fl) jX 2 n ’ .V = « + (^ - «) -■2 n ' We consider only the particular case i^ = l, when x and y are the extreme values of the sample, and then obtain 672 28.6 (28.6.10) D*(x-y)==‘\-^{b~aY+0 3. Cauchy's distribution. — For the distribution given by the fr. f. (19.2.1), the substitution (28.6.2) gives or wA r dt ^ J A^’u -7t)* / n ,x—/i arc cot 7C A . <1 .L An X “ /u 4- A cot - - = u + -~- n TT § 0 where § has the limiting distribution (28.6.4). The remainder con- verges in probability to zero, and it then follows from 20.6 that the An r:th value x from the top is, in the limit, distributed as ii 4- —r, where ?;== \ has the fr. f. 5 r(v) Similarly the rith value from the bottom, y, is distributed as fi w, where w is, in the limit, TC independent of v and has a distribution of the same form. In the case V = 1 , the mean values of x and y are not finite. For v>2 we have ( 28 . 6 . 11 ) I 2(r-l)M/~ + (){n) We observe that the variance does not tend to zero as n -»• ». Ac- cordingly ^ ^ does not converge in probability to so that ^ A 2 . is not a consistent estimate (cf 27.6) of fi. 4 . Laplaces distribution. — For the fr. f . (19.2.4) we obtain for the v:th value x from the top, when x> p, X ^ 4* A log 2 ~ A log where f has the limiting distribution (28.6.4). Substituting v for — log S, we thus have 378 28.6 a: = /i + A log 2 + A ?\ where t’=— log^ has, in the limit, the fr. f . ■'• w - Ar -'-'"- Similarly, the v:th value from the bottom is 2 / = ^ — A log — A w, where w is, in the limit, independent of v and has the fr. f . (rv). In the particular case v = 1 vre have (cf the following example) :28.6.12) E + -=^) = D* J ^) X 4 " ti and we observe that, as in the preceding case, — - is not a con- sistent estimate of /i. 5. The normal distribution, — Consider first a normal distribution with the standardized parameters m = 0 and 0 = 1 . If a; is the v:th value from the top in a sample of ?? from this distribution, (28.6.2) gives oo « r Ji , ; - I € '2 i ^27tJ It is required to find an asymptotic solution of this equation with respect to when v is large. By partial integration, the equation may be put in the form n X Assuming ^ bounded, we obtain after some calculation i/ti 1 loST log n + log X = \ 2 log w 2^2 log fet- + o(, ' n V2\osn \log / and it follows that the remainder converges in probability to zero. Proceeding to the general case of a normal distribution with arbi- 28.6 »C fH trarj parameters m and o’, we need only replace x by Substi’ a tutingf at the same time v for — log we thus ^nd that the r:th value X from the top has the expression (28.6.13) a? = m + cr ^ 2 log n — a log log n + log 4 7t 2 V2 log n V2 log n where v^^ — log^ is a variable which, in the limit as « -> oo, has the fr. f . (28.6.14) = e-' already encountered in the preceding example. Similarly we hare, for the v:th value y from the bottom, the expression log log « 4 log 4 7ir (28.6.15) y = 7)1 — <7 1 2 log n + a — - - 2 V 2 log )i \^ 2 log n where tv is, in the limit, independent of v and has the fr. f . (t«^). Thus for large values of n the nth values x and y are related by simple linear transformations to variables having the limiting distri- bution defined by the fr. f . (28.6.14). The frequency curves u==Jr(v) are shown for some values of r in Fig. 27. 28.6 We observe that the limitiDg distribntion has, except for different normalization, the same form as in the preceding example. A straightforward generalization of the above argument shows that the same limiting fr. f. {v) appears in all cases where the fr. f. of the parent distribution is, for large values of | rr |, asymptotically eX' pressed by where A, B and p are positive constants. The mode of a variable which has the fr. f . (v) is — log v, while the mean and the variance are given by the relations oo oo E(r) = J vjr(v]dv = --j^J — oo 0 00 D*(i-) = j v\'i,.(v)dv~(C-S,y = — oo oo ^ = - S„ 0 obtained by means of (12.5.6) and (12.5.7). Here C denotes Euler’s constant defined by (12,2,7), while 1 P —1 4- - + " • 4 1 iSg = , „ 4 h 1 + 1 1* 2^* 1 (v-1)^ Hence we obtain for the v.th value x from the top: log log n 4- log 4 7t 4 2(S^ — C) E[x) = 2 log n (28.6.16) 4- ‘'(ioi.))' 2 Vz log n D*(a:)==„ ,— - S,) + o(~i- ). 2 log w \ 6 / \log* nf and similar expressions for the v:th value v from the bottom. We further obtain ( 28 .e.n) + 00 "f" so that in this case — gives a consistent estimate for w?, though the variance only tends to zero as (log n)~^, which is not nearly so 376 28.6 rapidly as n ^ — For the difference x — y between the vith values we have \ r21og» \loff «// ( 28 . 6 . 18 ) We may thus obtain a consistent estimate for a by multiplying ic — y with an appropriate constant, and the variance of this estimate will, for a given large value of n, be approximately proportional to The limiting forms discussed above in connection with the normal distribution and Laplace’s distribution are due partly to E. A. Fisher and Tippet (Ref. 110), and partly to Gumbel (Ref. 120), in whose papers further information concerning the properties of these distri butions and their statistical applications will be found. In the limiting expressions for the case of the normal distribution, the remainder terms are of the same order as a negative power of log w. Now log n tends to infinity less rapidly than any power of //, and accordingly it has been found that the approach to the limiting forms is here considerably slower than e.g. in the case of the ap- proach to normality of the distribution of some moment characteristic. The exact distributions of the extreme values and the range of a sample Fig. 28. Distribution function for the upper extreme of a sample of n values from a normal population with m = 0 and o = 3 . Exact: . Approximate formula: 377 28.6-29.1 from a normal distribution have been investig^ated by various authors, and certain tables are available. The reader is referred to K. Pear- son’s tables, and to papers by Irwin, Tippet, E. S. Pearson and Davies, E. S. Pearson and Hartley (Eef. 264, 131, 226, 196, 197). We g^ive in Fig. 28 some comparisons between the exact distribution of the largest member of a sample and the corresponding distributions calculated from the limiting expressions (28.6.13) — (28.6.14). CHAPTER 29. Exact Sampling Distributions. 29.1. The problem. In the two preceding chapters, we have shown how to calculate moments and various other characteristics of sampling distributions, and we have investigated the asymptotic behaviour of the distributions for samples of infinitely increasing size. However, it is clear that a knowledge of the exact form of a sampling distribu- tion would be of a far greater value than the knowledge of a number of moment characteristics and of a limiting expression for large values of n. Especially when we are dealing with small samples^ as is often the case in the applications, the asymptotic expressions are sometimes grossly inadequate, and a knowledge of the exact form of the distri- bution would then be highly desirable. Suppose that we are concerned with a sample of n observed values from a one-dimensional distribution with the d. f. F{x), and that we wish to find the sampling distribution of some sample characteristic g{xx. . . ., Xn). The problem is then to find the distribution of a given function g[xi, . . ., a:n) of n independent random variables x^, . . ., x,,, each of which has the same distribution with the d. f. F[x). Theoretically, this problem has been solved in 14.5, where we have shown that there is always a unique solution, as soon as the functions F and g are given. Numerically, the problem may often be solved by means of the computation of tables based on approximate formulae. If, however, we require a solution that can be explicitly expressed iv terms of known functions, the situation will be quite different. At the present state of our knowledge such a solution can, in fact, only be reached in a comparatively small number of cases. One case where a result of a certain generality can be given, is 378 29.1-2 the simple case of the mean ^ one-dimensional sample. In Chs 16 — 19 we have seen (cf 16.2, 16.5, 17.3, 18.1, 19.2) that many distributions possess what we have called an addition theorem, i. e. a theorem that gives an explicit expression for the d. f. G«(.t) of the sum + • • ■ + Xn^ where the Xi are independent, each having the given d. f. F[x), The d. f. of the mean x is then Gn[nx), and thus we can find the exact sampling distribution of the mean, ivhencver the parent distribution possesses an addition theorem. — We shall give some examples : When the parent F{x) is normal (w, a), we have seen in 17.3 that the mean is normal {m, al\ w). When F{x) corresponds to a Cauchy distribution, we have seen in 19.2 that .1 has the same d. f. F{x) as the parent population. When the parent has a Poisson distribution with the parameter k, 1 2 the mean x has the possible values 0, and it follows from n n (16.5.4) that we have ~ e*"’*^*. Apart from the case of the mean (with respect to this case, cf Irwin, Kef. 132), very few results of a general character are known about the exact form of sampling distributions. Only in one particular case, viz. the case of sampling from a normal parent distribution (in any number of dimensions), has it so far been possible to investigate the subject systematically and reach results of a certain completeness. In the present chapter, we shall be concerned with this case. Some isolated results belonging to this order of ideas were dis- covered at an early stage by Helmert, K. Pearson and Student. The first systematic investigations of the subject were, however, made by R. A. Fisher, who gave rigorous proofs of the earlier results and discovered the exact forms of the distributions in fundamentally important new cases. In his work on these problems, Fisher generally uses methods of analytical geometry in a multi-dimensional space. Other methods, involving the use of characteristic functions, or of certain transformations of variables etc., have later been applied to this type of problems. In the sequel, we shall give examples of the use of various methods. 29.2. Fisher’s lemma. Degrees of freedom. — In the study of sampling distributions connected with normally distributed variables, 379 29.2 the following: transformation due to E. A. Fisher (Ref. 97) is of ten- useful. Suppose that ;r,, . . Xn are independent random variables, each of which is normal (0, or). Consider an orthogonal transformation (cf 11.9) (29.2.1) yi = Ci\ .r, + + • • + (i = 1, 2, . . ., n\ replacing the variables x*,, . . by new variables //„. By 24.4, the joint distribution of the f/, is normal, and we obtain (cf Ex. 16, p. 319) E(y,) = 0, and w E(y,yk) = a'^'^e,jCkj = I for for / - / ^ so that the new variables ;/, are uncorrelated. It then follows from 24.1 that they are even independent. This the transformed variahles yi are independent and normal (0, a). The geometrical signification of this result is evident. The trans- formation (29.2.1) corresponds (cf 11.9) to a rotation of the system of coordinates about the origin, and our result shows that the parti- cular normal distribution in R„ considered here is invariant under this rotation. Suppose now that, at first, only a certain number p < n of linear functions //,, y^, . , ., yp are given, where yi — e^x^ 4 - • and the Cf i satisfy the orthogonality conditions I I for i = /r, I 0 for / ^ /•, for / = 1, 2, . . ., p and Jc = 1, 2, . . p. By 11.9 we can then always find n — p further rows Cii, . . ., r/,,, where i^p -Y such that the complete matrix Cnn — {ciit} is orthogonal. — Consider the quad- ratic form in . . ., (29.2.2) Q{xy, . . ., a:„) =- 2^' ~ 1 11 If we apply here the orthogonal transformation (29.2.1), is by 1 n 11.9 transformed into obtain 1 V ” .Vp-n 4- ■ • • -f //h. 380 29^-3 Tbas Q is eqaal to the sum of the squares of n — p independent normal (0, ct) variables which are, moreover, independent of Pi , . . ., pp. Using (18.1.8), we obtain the following lemma due to B. A. Fisher (Ref. 97): The variable Q defined by (29.2.2) is independent of y,, . . Pp and has the fr.f. n —p 1 2 X a e 2 (>*, fvhere kn{x) i$ the fr.f. (18.1.3) of the x^‘distributiou. The number n — p is the rank of the form Q (cf 11.6), i. e. the smallest number of independent variables on which the form may be brought by a non-singular linear transformation. In statistical applica- tions, this number of free variables entering into a problem is usually, in accordance with the terminology introduced by R. A. Fisher, denoted as the number of degrees of freedom (abbreviated d. of fr.) of the problem, or of the distribution of the random variables attached to the problem. n Thus e. g. the variable = 2 considered in 1 18.1 are said to possess n degrees of freedom, since the quadratic form is of rank n. The corresponding distribution will accordingly be called the with n degrees of freedom. n Similarly the form Q=^^x] —y\ yl of rank n—p con- 1 sidered above will be said to possess n — p degrees of freedom, and the result proved above thus implies that the variable Q/a* /.9 distri- buted in d xf distribution tvith n—p degrees of freedom. The same terminology will often be applied also to other distri- butions. In the case of Student’s distribution, it is customary to say that the fr.f. Sn{x) defined by (18.2.4) is attached to Student's distri- bution with n degrees of freedom, since the quadratic form in the de- nominator of the variable t as defined by (18.2.1) has the rank n. For Fisher’s ^-distribution (cf. 18.3), we have to distinguish between the m d. of fr. in the numerator of (18.3.1), and the w d. of fr. in the denominator. 29.3. The Joint distribution of u' and s"^ in samples from a normal distribution. — We have already pointed out in 29.1 that the mean 381 29.3 JJ of a sample of n from a parent distribution which is normal (m, a) is itself normal (wi, ajV n). We now proceed to consider the distribu- tion of the sample variance 6'* = mg — ^ 2 (^* same time, the joint distribution of x and 6®. Without loss of generality, we may then assume that the population mean m is zero, since this does not affect and is equivalent to the addition of a constant to x. We thus assume that every Xt is normal (0, a), and consider the identity (cf 11.11.2) (29.3.1) 2 (^1 — ^“)* = 2 • 1 1 Now -f • -f j is the square of a linear form \l^ w V nl c^Xi + • -f CnOCn such that + • • • + Cn 1. We may thus apply the lemma of the preceding paragraph, taking in (29.2.2) p = 1 and = Returning to the case of a general population mean m, we then have the following theorem first rigorously proved by R. A. Fisher (Ref. 97): The mean x and the variance of a normal sample are independent, and X is normal (m, n), tvhile ns^h^ is distributed in a yf distribu- tion with n — I degrees of freedom. It can be 8hown that the independence of j’ and holds only when the parent distribution is normal (cf Geary, Kef. 116, and Lukacs, Kef. 160). On the other hand, we have seen in 27.4 that 7* and s* are uncorrclated whenever the third central mo- ment of the parent distribution is zero. It follows from the theorem that the unbiased estimate (cf 27.6) of the variance, — — - .s®, has the fr. f. ^ ^ ^n-i Com- n — 1 (T* \ a* / paring with the fr. f . of given in the table at the end of 18.1, it is seen that the variable 2 is distributed 1 as the arithmetic mean of n — 1 squares of independent normal (0, a) variables, in accordance with the fact that there are n — 1 d. of fr. in the distribution. 382 29.S The mean and the variance of s* = have already been g^iven in (27.4.5). By means of (18.1,5) we obtain the following general ex- pression of the moments (29.3.2) E(»,;) - Hence we deduce the expressions for the coefficients of skewness and excess : I \ 12 yj (wj) = , Yt (wjg) = - — - ■ For the s. d. s==Vm^ of the sample we obtain from the theorem^ using Stirling’s formula (12.5.3) (29.3.3) in accordance with the general expressions (27.7.1) and (27.7.2). In view of the great importance of the theorem on the joint distribution of x and 6^, we shall now give another proof of the same result, using certain transformations of variables, combined with geo- metrical arguments. As before, we suppose in the proof that ni = 0. Consider the ^-dimensional sample space Rn of the variables Xi, . . Xn- Our sample is represented by a variable point in this space, the sample point X=X{xi, . . ., a:^n)- Let XR be the perpendicular from X to the line ajj = rcg = • • • == a;,,. Then R has the coordinates (i', . . ., x) so that the square of the distance OR from the origin 0 n to R is n f and consequently X R^ OX^ — OR^= 2 ~ ^ / 1 The joint distribution of the variables x, is conceived in the usual way as a distribution of a mass unit over Rn, and the probability element of this distribution is 383 29.3 1 - > Va’ dP = ~-e I ‘ dxi . . . dxn- (2 «)■•* ff" We now perform a rotation of the coordinate axes, such that one of the axes is brought to coincide with the line OS. This rotation is n expressed by an orthogonal substitution yi = where one of 1 the .Vi, say .Vn, is equal to \ nx=^ y + • • We then obtain V n \ n n n n— 1 n~l ^Xi='^y\ — nx* + ^ yj, and hence — s*. The determinant 11 1 1 of the substitution being ± 1, we have by (22.2.3) dP= — e dy^ . . . dy n-i dy„ {2 7tYa« Vn = e -« (ly, . . . dy„~,d:i . (2 7t)*<T’* We further introduce the substitution (29.3.4) ifi = V nszu (^ = 1, which signifies that we take the length XB = y ns as unit. How- ever, by the last substitution we have replaced the w — 1 variables v* by n new variables s and . . ., ^n-i- Accordingly there is a relation between the new variables, which is found by squaring and adding the n — 1 equations (29.3.4). We then obtain n-l (29.3.5) = 1 • and thus one of the Zi^ say Zn-u expressed as a function of the n — 2 others, so that in (29.3.4) the old variables Vi* • • •» yn-i are replaced by the new variables s and ^n- 2 - For the Jacobian J of the transformation we have, since ^ dZi Zn-l 384 29.3 n S', 1 ^ 0 0 1 0 . . . 0 u z^ 0 0 « — 1 ^8 0 1 . . 0 • • n fl Zn—’l 0 0 . ... V ns 2^11— l -S'/I- 0 0 . . . 1 n Zh-\ — V 1} fi -s-i . . . — 1 MS Zn-2 1 .r® . . Zn — l 2n-\ i»-i n-1 = ( -1) ..-i” j''""'' = + « ^ s” -2 1 1 - ?r~“ — — 2 • any system of values (//,, . • •, (0, . . ..0) we obtain from (29.3.4) and (29.3.5) a uniquely determined system of values of .^n ~‘2 and A’, such that ^ > 0. On the other hand, to any given »i--2 system of values of . . ., ^n -2 and s, such that < 1 and a>0, 1 there correspond tivo values of ,fn-i with opposite signs determined by (29.3.5), viz. .e'u-i = ± — .f'{ — — 5'n-2, and thus two systems of values of the say //|, . . ^n- 2 , ± y«-i- Both these systems yield the same value of the probability element dP and the modulus \J\ of the Jacobian, and thus we obtain by means of a remark in 22.2 the expression The probability element dP appears here as a product of three fac- tors, viz. the probability elements of x and s, and the joint probability element of . . ., Zn- 2 - We thus see (cf 22.1.2) that x and s are in- dependent not only of one another, but also of the combined variable (zi, . . ., Zn~^)y and that the distributions of x and s are those given by the above theorem.^) The same result can be obtained by means of the transformation x,- ~ x + « which has been used for this and other purposes e. g. by Behrens, SteflFenaen, Itascdi and Hald (Ref. 60, 218, 206 . 25 — 464 H, Cramer 385 29.3 For a later purpose we finally observe that, in the general case when the population mean m is not zero, the above transformation of the probability element may be written dP- 1 -j-jiSIx/—)* dXi . . . dXn (29.3.6) _ ri__ aV 2 n^ 2«V-r ds - ds & de, . . . dzn-i fl-1 |/ 7t Vl—z\ Consider the effect of the above transformation on the expression «r'’ «rV * / ^ " niv ml - j By means of the identity (29.3. 1)^ it is easily shown that every — x is transformed into a linear combination of y,, . . It then follows from (29.3.4) and (29.3.6) that is a function of z^t . » only. Thus the three variables ir, g and arc independent. (Cf Geary, Ref. 116). Following Geary, we can use this observation to obtain exact expressions (first given by Fisher, Ref. 101) for the mean and the variance of the coefffcients 171 == and — 3, instead of the asymptotic expressions r27.7.9). It follows, in fact, from the independence theorem that / / 1P\ / \ • E \ni^^ J = E .so that the mean value of (m,. m“*’/3)^ can bo calculated from iE(nij!) and E In this way we obtain £Cv.) = o. (29.3.7) 6 (n - 2) (n + 1) (n + 8) D\g,)^ 24 n (?i — 2)(n — 3) (n + 1)*(»"+ 3)(n + 6) Thus y, is affected with a negative biai( of order while y, is unbiased. If, instead of yi and y^, we consider the analogous quantities f _ E* - ^ ~ „ ' jr'/« n -2 e\ (n -2) (» - S) 386 .29.8.8 , 29.3-4 where the Kv are the unbiased semi-invariant estimates of Fisher ,cf 27. 6\ the bias disappears, and we obtain E{O^) = E{(i2) = 0, (29.8.9) 6n (n - 1) n — 2)(w -f l)(n + 3)' D'iG^)== 24 n (n - D* {n - 3)(w - 2)(n + 3)(n -h 6)’ 29.4. Student's ratio. — Consider the variables V‘ n — m) and j when the parent distribution is normal (w, a). According to the preceding paragraph, these two variables are independent, and ^ n(x — m) is normal (0, a), while — is distributed as the arith- n — 1 metic mean of n — 1 squares of independent normal (0, or) variables. Bj the definition of Student’s distribution in 18.2, the ratio (29.4.1) is then distributed in Student's distribution with n — 1 degrees of free- dom, Thus t has the fr. f. This can, of course, also be shown more directly. Assuming for simplicity m = 0, we replace the sample variables iCj, . . Xn by new variables ^ ■ -y Vn by means of an orthogonal transformation such n n that ifi = VTi 1= ^- 1 - + h Then ns*='^x} — n = y y,’ and V n V n j 2 thus / = 387 29.4 where by 29.2 the ?// are independent and normal (0, a). We can then directly apply the argument of (18.2.1) — (18.2.4). If, in the first expression of t in (29.4.1), we replace by mean a*, we obtain the variable Vn - — which is obviously normal (0, 1). It follows from 20.6 a tr — in that the difference t—Y n — ^ — converges in probability to zero as w— ► ao. Accord- ingly by (20.2.2) the fr. f. of t tends to c-a;*/2 ns w — <*>. V 2 % The variable t defined by (29.4.1) is known as Student's ratio}) Its distribution was first discovered by Student (Ref. 221), whose results were then rigorously proved by R. A. Fisher (Ref. 97). As already pointed out in 18.2, the fr. f . 5n-i, as well as the vari- able t itself, does not contain o. As soon as we know m, we may thus calculate t from the sample values, and compare the observed value of t with the theoretical distribution. In this way we obtain a practically important test of significance for the deviation of the sample mean x from some hypothetical lvalue of the population mean m (cf 31.2 and 31.3, Ex. 4). Of even greater practical importance is the application of Student’s distribution to test the significance of the difference between Uvo mean values (R. A. Fisher, Ref. 97; cf 31.2). The sampling distribution relevant to this problem is obtained as follows. Suppose that we have two independent samples Xj, . . ., Xm, and drawn from the same normal population. Without loss of generality, we may assume m = 0. Let the mean and the variance of the first sample be denoted by j; = — V Xi and = ~ 'V. (xt — x)*, * 1 * I while y and si are the corresponding characteristics of the second sample. We now replace all the n^ -h n^ variables Xj, . . ., Xn„ ?/i, . . ., by new variables . . ., -2'n,+n*, by means of an orthogonal transforma- tion such that == Kw|X and V^y. The quadratic form W| ^ ^1 “ I " '^^3 ^2 1 2^’ — w»?/* 1 71 , is then transformed into Q = ^ Siy which shows that the rank, or 3 *) Student actually considered the ratio z—tl^n — 1 — (x — m)/®. 388 29.4 the number of d. of fr., of Q is — 2. If we define a random variable u by the relation (n, + ng —• 2) ^ ./ — y + «2 I Q ng H- Mg — 2) J — y '^h V n^$\ + 712 ^^ 2 ' li is then transformed into (29.4.2) H — ni + v^ 3 where w and fg, . . + are independent and normal (0, o). We can now once more apply the argument of 18.2, and it follows that the variahle ii is distributed in Student's distribution with n^ -V — d. of fr., so that u has the fr. f . 2 (a?). This result evidently holds true irrespective of the value of m. — It will be observed that in this case neither the variable u nor the corresponding fr. f. contains any of the parameters m and a of the parent distribution. Thus we can calculate ii directly from the sample values, and compare the observed value of H with the theoretical distribution (cf 31.2 and 31.3, Ex. 4). n n Consider the quadratic form n «* = — j )* = S JC* in the n sample 1 1 variables . . ., assuming that the population mean m is zero. Replacing the .r,. by new variables by means of an orthogonal transformation such that the two first variables are the form n is transformed into X V/* t’onsequently the variable ( 29 . 4 . 3 ^ 389 29.4-5 which expresses the deyiation of the sample value Xi from the sample mean measured in units of the s. d. 8 of the sample, becomes .Vs V w ~ 1 1: ' Now ^ 2 * • • • Vn independent and normal (0, a), and thus by (18.2.6^ and (18.2.7 ) the variable r has the fr. f. (cf Thompson, Kef. 226, and Arley, Kef. bS) 29.4.4 The variable j „_4 (' “ ^-l) ' ’ (I ^ I < - 1’ T Vw — 2 is then, by 18.2, distributed in Student's distribution with V^n - 1 - r* n — 2 d. of fr. — It follows from the definition of r that these results bold ir respective of the value of tn. Any relative deviation __ .r has, of course, the same distribution as r. These results are of importance in connection with the question of criteria for the rejection of outlying observations. Xi-i h.r^. More generally, if we consider the arithmetic mean , where has the fr. f. (29.4.4 ; 1 k < n, and write r*. == the variable r^. I, /^k in — 1) and consequently the variable ,29.4.6) <! = y k(n — 2> Vn - k rj has Student's distribution with w — 2 d. of fr. (Thompson, Ref. 226 . This may be used for testing the significance of the difference between the mean of a sub-group and a general mean (cf 81.3, £x. 5). 29.5. A lemma. — We now proceed to the study of sampling distributions connected with a multidimensional normal parent distri- hution. In this preliminary paragraph, we shall prove certain results due to Wishart and Bartlett (Bef. 240, 241) that will be required in f ^11 • • . flu- 1 the sequel. Let^ = , where aj < ~ fli j, be a deBnite positive fljti . . . fljtjt matrix (cf 11.10) with constant elements, while X = . . . Xik Xk\ . . . Xkk 390 29.5 where xj t = Xi j, is a variable matrix. Owing to the symmetry X con- tains, of course, only 1) distinct variables Xij, The determi- nants of the matrices are denoted by A = \aij\ and X=|ir/j|. Consider now the ^ + l)-dimensional space jRjjiu+i) of the variables Xi j, where A: ^ 1 . Let S denote the set of all points of this space such that the corresponding matrix X is definite positive, while S* is the complementary set. For any n > k, we now define a func- tion of the variables Xij by writing (29.5.1) /„(x,„...,rrH)= ^ ‘ lo in S*, where Ctn is a constant depending on k and », but not on the a,} or the Xij. The sum is extended over i — 1, . , X- and ~ 1, . . ., Z*. We shall now show that the constant Ckn determined that /n(xii, . . is the fr.f. of a distribution in R^k{k+i)- — The com- plete expression of Ckti is, in fact, (29.5.2) Ckn = M*-i) 7t ^ T n—l For (29.5.1) — (29.6.2) reduce to /«(») — ' . ^("2 (a? > 0, a > 0), which is evidently a fr. f. in Kj. For ^>1, we have to show that Cun may be determined such that the integral of fn over the whole space R^k{k-^-i) is equal to 1. We shall first consider the particular case when .4 is a diagonal matrix (cf 11.1), so that aij = 0 for i Since A is definite posi- tive, we then have «<, > 0 for e = 1, . . ., A:. — In any point of the set we have Xa^O for z = 1, . . ., k. Introducing, for every Xij with i ^ j, the substitution (29.5.3) Xij = f/ij V Xi iXjj, we have yji=^yij, and X^DYD, where D denotes the diagonal matrix with the elements V V V Xkk, while 391 29.5 ’ 1 j/ii • • Vlk r= 921 1 • y^k ?/i i yk i . ■ • 1 Denoting^ by }' the determinant of Y we thus have X~.riiX 22 • • ■ ookk When X is definite positive, so is F, and conversely. The Jacobian of k-l the transformation (29.5.3) being* ^ , we thus have r n-k-2 I X ^ r 1 “2 / •*/ / OO 00 / / / ” ^ X ^ / j / = J I . . . -nk) e 1 r/.r,, ^/.r 22 • • . dxik' (f - . . . djfk~.\.k. the integral with respect to the y,j being extended over the set S' of all yij such that Y is definite positive. Obviously the integral with respect to the yij, say Jjt, depends only on I' and ?/, so that the whole integral reduces to HV)P‘ _ //*« M-l Tl~ 1 ' (r/ji (f22 • . • (fik) ‘ A where likn depends only on k and n. Taking in (29.5.1) a,, = it follows that the integral of /,i(xii, . . ., Xkk) over the whole space R^k{k+\) is equal to 1, so that fn (being obviously non-negative) is the fr. f. of a distribution in R^k{k+ih Jn order to complete the proof in Ahe case when a, j =0 for i -/ J, it remains to verify the expression (29.6.2) for Tt folloAvs from the above that Ave have to prove ii~k-2 ‘^k ~ f ^ ^ • • • ^^yk-l,k "" N' 1) T 4 for 2 ^ k < n. This may be proved by induction, and we shall indicate the general lines of the proof. For k = 2, oiir relation reduces to 392 29.5 1 W — * J's = /(i-.vV-* rfv -1 which may be directly verified, since the substitution ty* = z changes the integral into a> Beta- function (cf. 12.4). Suppose now that our relation has been proved fora certain value of /r, and consider Expanding the determinant under the integral according to '[11.6.3\ we obtain for the expression k n — k—fi j <?.vi2- ••'«.'/*._ 1, 1 - /(v-i: y,jyi,t+iV!.k+i) * i,J 1 where the integral with respect to the y- has to be extended over all values of k the variables .such that k^ i-(i k+i The latter integral may be evaluated ij-i by the same methods as the integrals Ul-12.3)— '^11.12.4 , and we obtain Thus the relation holds for 4 -1 1, and the proof is completed. In the general case when A is any definite positive matrix, we consider the transformation (29.5.4) CAC^B, C'XC - Y, where C is an orthogonal matrix such that B is a diagonal matrix (cf 11.9). The set S in the r-space is transformed into the analogous set 6\ in the ,vspace. From the proof given above, it then follows that the function (29.5.5) g„(yu, I/kk) — J «~1 a. B < U Y ^ c in Si, in Si, is a fr. f. in the //-space. (Note that we have hij = 0 for i j) Now, since the determinant of C is equal to ±1, we have A ^ JB and .Y ~ y, and it is further verified by direct substitution that we have = Thus if, in the distribution (29.5.5), we introduce i.j fj the transformation of random variables defined by (29.5.4), we obtain according to 22.2 a transformed distribution with the fr. f. fn[x^x , . • •, xn). Thus fn is a fr. f., and our assertion is proved. 393 29.B-6 In the particular case k » 2, there are three variables and Xi^ — J*if The set 8 is the domain defined by the inequalities > 0, x^s >0, xf , < x^ x^t. In 8 we have (29.6.«) /„ (xn, Xi„ X,,) n— 1 2 n — 4 “^U STjs — 2 Mis 3*is where (cf 12.4.4) Ontside 8 the fr. f. is zero. gW — ® nlrln— 2J We shall also consider the c. f. q>n[ti\, . • corresponding to the ir. f. /n(a;,i, . . Xki) defined by (29.5.1). Let T= {Uj] denote the sym- metric matrix of the variables Uj^ and put _[1 fort=jf, for if^j. Since /n == 0 in S*y the c. f. corresponding to the fr. f. is ^n(^ll7 • • M = I e • • •» ^kk) ■ • • (^^kk> ^In order to avoid confusion, we use here a heavy-faced i to denote the imaginary unit, as already mentioned in 27.1.) For ff ; ==0, the integral is equal to 1, so that we have / n-k^ X 2 e * ^ dxi, axij, . . . dxik==- 1 n— 1 ’ CkuA'^ ]Replacing here aij by aij — iefjtijy and denoting by .4* the deter- minant A* = \aij — we obtain finally the expression <29.5.7) (^11 » • • . , tkk) =- 1 for the c. f. corresponding to the distribution (29.5.1).^) 29.6. Sampling from a two-dimensional normal distribution. — In a basic paper of 1915, R. A. Fisher (Ref. 88) gave exact expressions iDgham (Ref. 180) has shown directly that the c. f. (20.6.7) gives, according to the inversion formula (10.0.8), the fr. f. (20.6.1}. 394 29.6 for certain sampling distributions connected with a two-dimensional normal parent distribution. We shall now prove some of Fisher’s results, using the method of characteristic functions first applied to these problems by Bomanovsky (Ref, 208, 209), It will be found that the distributions obtained are particular cases of the distributions considered in the preceding paragraph. Consider a non-singular normal distribution in two variables (cf 21.12). Without loss of generality, we may assume the first order moments equal to zero, so that the fr. f. is in the usual notation 2 n Oi 02 V 1 — p* 2nVM where M = ~ fAi = — Q^) is the determinant of the mo- ment matrix M = < [• From a sample of n observed pairs of values 1 ^ 1 , (^yi. ^n), we calculate the moment characteristics of the first and second orders (cf 27.1.6) >«8o = — •^)* = 1: S (29.6.1) 7n „ — r Si s, = - 2 ~ ~ if' w?o* = sj = - 2 (.»/< — y)* = ~ 2 .y< ~ .V*- We now propose to find the joint distribution of the five random variables .f, mgo, and niQ^, The c. f. of this distribution is a function of five variables t^o, and fof, viz. {29.6.2) — - j e~dxi . . . dXndyi . . . dy,„ where 1 " i2 = i (<, J- 4- • • ■ + < 0 , THos) ~ 2M 2 ~ ^ lf‘ 395 29.6 and the integral is extended over the 2 n-dimensional space of the variables x,, . . ?/,, . . !/«. We now replace Xjy . . Xu by new variables 5i» • • •> by mea^is of an orthogonal transformation such that = 1^^ Xy and apply a transformation with the same matrix to j/i, . . which are thua replaced by new variables tji, . . such that rji = V n j). We then have ill 1 11 n n n Hm*o = 2^’ = ”'o! = 2 and hence — 2'jtf ^ ^ Mil 5i »/i + M*o V') — ^ A^02 „ ^ 2 J/ 2 Jf + + \2M Introducing this expression of Q in (29. 6. 2), the transformed 2 72'fold integral reduces to a product of n double integrals, which may be- directly evaluated by means of (11.12.1) and (11.12.2). The joint c. f. (29.6.2) then takes the form (29.6.3) where _ 1 , 'J n niXoi WMii 2M 2M tifiu «Mm) 2M 2M t»-i \ 2 Of’ A* = 2M iit,, 2M 396 29.6-7 The joiDt c. f. (29.G.3) is a product of two factors, the first of which contains only the variables and while the second factor contains only tu and /qs- The first factor is, by (21,12.2), the c. f. of a normal distribution with zero mean values^) and the moment matrix n~^M. The second factor, on the other hand, is a particular case of the c. f. (29.5.7). In fact, if we take in the preceding paragraph A* =2 and W^ll 2M 2M n = ,M-V 2 2M 2M\ T = 1 ^*0 ^11 ] Ui. ^oa J the c. f. (29.5.7) reduces to the second factor of (29.G.3). The corres- ponding distribution is then the particular case Jc = 2 of (29.5.1), (which has already been given in 29.5.6), with the variables Xig and replaced by win and respectively. Thus by 22.4 we have the following theorem: The combined random variables (i*, y) and mj,, >^ 02 ) indepen- dent. The joint distribution of I and y is normaly trith the same first order moments as the parent distribution^ and the moment matrix n''^ M. The joint distribution of m 2 o, and has the fr.f fn given by (29.6.4) fn (wuo, Wj,, wJoa) = f> il/ ^ in the domain jn^o > 0 , > 0 , m\i < ^vhile fn -- 0 outside this domain. The mean values and the moment matri.x of the five sample moments may he calculated from the c. f. (29.6. 3\ We find, e. g., £(w? 2 o) == /W 20 t -E (w*ii)= n n Tl 1 ^(w'o 2 ) — accordance with 27.4 and 27.8. n 29.7. The correlation coefficient. — In the jpint distribution (29.G.4) of the variables Wgo» and we now introduce the new variable ‘) If, more generally, we consider a parent distribution with arbitrary mean values, we obviously obtain here the same means as for the parent distribution. 397 29.7 r by the substitution rwjj = r so that r is the correlation coefficient of the sample. By (22.2.3), we then obtain the following^ expression for the joint fr. f. of and r: y m,o>Woj fn {mgoy r V m^o Wo,, wioa) == M — 1 n— 3 n~4 n , ^ " «” ' -j- IT/-, »w... ( 1 — n e ^ 47irr(w -2)M " where m^o > 0, Woj >0, r* < 1 . The marginal fr. f . of r is now ob- tained by integrating the joint fr. f. with respect to m^o and Wqj from 0 to 4-00. If the factor developed in power series, the integration can be explicitly performed, and we thus obtain the /r.f, of the sample correlation coefficient r: (29.7.1) /«(r) = 2ti-3 7r (n — 3) ! w — I A ' ' for — 1 < ;• < 1 . The power series appearing in this expression may be transformed in various ways. We find, e. g., by simple calculations the expansion (l — pr,r)’-‘ \ 2 j v\ and hence obtain the following expression for the fr. f . of r: tJ — 9 n—i '/I — 4 /’ /»n — 2 t] nr (29.7.2) (I-, ■<)--■ j The distribution of r was discovered by R. A. Fisher (Ref. 88). We observe the remarkable property that the distribution of r only depends on the size n of the sample and on the correlation coefficient Q of the population. For n = 2, the fr. f . fn (r) reduces to zero, in accordance with the fact that a correlation coefficient calculated from a sample of only two observations is necessarily equal to ±1, so that in this case the distribution belongs to the discrete type. For m = 3 the frequency 398 2^.7 Fig. 29 a. Frequency curves for the correlation coefficient r in samples from a normal population, n = 10. curve is [/-shaped, with infinite ordinates in the points ?* = Hh 1. for « == 4 we have a rectangular distribution if p = 0, and otherwise a ^/-shaped distribution. For > 4, the distribution is unimodal, with the mode situated in the point r = 0 if p = 0, and otherwise near the point r = q. Some examples are shown in Pigs 29 a — b. The distribution of r has been studied in detail by several authora (cf e. g. Soper and others, Eef. 216, and Romanovsky, Ref. 208), and extensive tables have been published by David (Ref. 261). Various exact and approximate formulae for the characteristics of the distribu- tion are known. Any moment of r can, of course, be directly calcul- ated from (29.7.1), but we shall here content ourselves with the asymptotic formulae for E(r) and I>*(r) for large n that have already been given in (27.8.1) and (27.8.2). For practical purposes, it is often preferable to use the trans- formation (29.7.3) ^ = i log J ^ ^ i log p ’ introduced by R. A. Fisher (Ref. 13, 90). Fisher has shown that the variable .z is, already for moderate values of w, approximately nor- 399 29.7 Fi/jf, 29 b. Frequency curves for the correlation coefficient r in samples from a normal population, n = 50. nially distributed with mean and variance given by the approximate expressions {29.7.4) Thus the form of the ^-distribution is, in the first approximation, in- dependent of the parameter p, while the distribution of r changes its form considerably when p varies. It is instructive to compare in this respect the illustrations of the r- and ^'-distributions given in Figs 29 and 30. Cf further 31.3, Ex. 6. In the particular case p = 0, the fr. f. (29.7.1) reduces by (12.4.4) to (29.7.5) a form conjectured by Student (Ref. 222) in 1908. We have already encountered this fr. f . in other connections in (18.2.7) and (29.4.4). By 18.2, the transformed variable f == F w 2 is in this case F 1 — 400 I Fig. 80 b. Frequency curves for r = log samples from a normal popula- tion. n = 50. distributed in Student’s distribution with n — 2 d. of fr. If tp denotes the p % value of t for w — 2 d. of fr. (cf 18.2), we have the prob- ability % of obtaining a value of t such ^ that \t\>tp, and this inequality is equivalent with (cf 31.3, Ex. 7) (29.7.6) \r\> 401 26 — 464 H. CramSr 29.8 29.8. The regression coefficients. — The regression coefficients of the parent distribution Ai — fHo ’ Pl2 — ~~ — ' " ’ f ^02 <^2 have been defined in 21.6. In accordance with the general rules of 27.1, the corresponding regression coefficients of the sample will be denoted by (29.S.1) ^ 3 ^ ^ ^ ?;.v, Wjo Si »n„j A’a It will be sufficient to consider the sampling distribution of one of these, say The distribution of Z #,2 c^-n then be obtained by per- mutation of indices. In the joint distribution (29. (5.4) of m 2 o, and w?og, we replace mil by the new variable by means of the substitution mu =nuQb^i. We can then directly perform the integration, first with respect to mo 2 over all values such that mo 2 > w^ 2 o ?; 2 i, and then with respect to mgo over all positive values. In this way we obtain the following ex- pression for the fr. f\ of the sample regression roeffeient (29.8.2) 7 # — 1 M » Vnrr~ (n — l\ ’ir? \ 2’ / f>ji 2fiiibji + ^os)' This distribution was first found by K. Pearson and Roman ovsk}" (Ref. 185, 210). If we introduce here the new variable (29.8.3) \ M (To r 1 — fti)) where it is found that t is distributed in Student's distribution with n — 1 d. of fr\ If we compare the distribution of b^i with the distribution of r, it is evident that the former has not the attractive property belonging to the latter, of containing only the population parameter directly corresponding to the variable. The fr. f. (29.8.2) contains, in fact, all three moments .Un and i^ want to calculate the (juantity t from (29.8.3) in order to test some hypothetical value of 402 29.8-9 An we shall have to introduce hypothetical values of all these three moments. In order to remove this inconvenience, we consider the variable {29.8.4) , _ A‘, Vn - 2 . * l/ \ ' Al)» .v« V 1 — where the population characteristics a,, and q occurring in (29.8.3) have been replaced by the corresponding sample characteristics .Vg and r, while the factor V n — 1 has been replaced by V — 2. If this variable t' is introduced instead of in the joint distribution (29.6.4), the integration with respect to and can be directly performed, and we obtain the interesting result that t' is distrihuied in Student's distribution ivith n — 2 d. of fr, (Bartlett, Ref. 54.) The replacing of the population characteristics by sample characteristics has thus re- sulted in a loss of one d. of fr. — When it is required to test a hypothetical value of we can now calculate f' directly from an actual sample, and thus obtain a test of significance for the deviation of the observed value of from the hypothetical (Of 31.3, Ex. 6.) 29.9. Sampling from a i-dimenslonal normal distribution. — The results of 29. (> may be generalized to the case of a ^-dimensional normal parent distribution. Consider a non-singular normal distribu- tion in k dimensions (cf 24.2). Without loss of generality, we may assume the first order moments equal to zero, so that the fr. f. is (cf 24.2.1) (29.9.1) {27tY'"^Vyl ~ 1 (2 Oi ... Oil F where A = {k,j} is the moment matrix, and P---”{Qij} the correlation matrix of the distribution (cf 22.3). yl and P are the corresponding determinants. Throughout this paragraph, the subscripts i and j will always have to run from 1 to k. Suppose now that we dispose of a sample of n observed points from this distribution. Let the r:th point of the sample be denoted by [xir, .r-i., . . Xk^), where v = 1, 2, . . ., ??, and suppose n>k. We then calculate the moment characteristics of the first and second order for the sample. According to the general rules of 27.1, and the nota- tions for the corresponding population moments introduced in 22.3, these will be denoted by 403 29.9 (29.9.2) It I = «/ = - 2 ~ 1 " lij — Vij Si Sj = ^ 2 1 ^*'' '^'i) ('^Jv r=l There are k sample means and Ic variances lii==^s). Further, since lji = lij, there are \k(k — 1) distinct covariances hj with The total number of distinct variables hj is thus 1). The matrices L={lij] and R = {rtj} are the moment matrix and the correlation matrix of the sample, while the corresponding deter- minants are L = \lij\ and R == | Vij ]. The joint distribution of all the variables a, and hj can now be found in the same way as the corresponding distribution in 29.6. In direct generalization of (29.6.2), we obtain for the joint c. f. of all these variables the expression dxu . . . (Ixkv (29.9.3) (2 7r)2 I + £ 2 9^' 2 ^ r- 1 i,j where the integral is extended over the X:w-dimensional space of the variables Xtv (^ == 1, . . ., it, v = 1, , . w), while as in 29.5 we write Bij = 1 for ^ =jy and £j ; = i for i f^j. For every f, we now replace the set of n variables x,i, . . ., Xin by n new variables by means of an orthogonal transformation such that nxiy using the same transformation matrix for all values of i. We then have for all i and j n H ^ Xi V Xjv = 2 v=l v—1 n w « = 2 ~ ” v=l r=2 and hence 404 39.9 (29.9.4) ^ < ij %se^ i,J ^ ' Introducing^ this expression of £2 in (29.9.3), the integnral may be evaluated in the same way as the corresponding integ^l in (29.6.2), and the joint c. f. (29.9.3) assumes the form (29.9.5) c'»‘»5'''‘''^-(f.)’**, where A and A* denote the determinants of the matrices and 1 A* j nAij [2A — iBij Thus in particular A^{\nY A""^, In the same way as in 29.6, the joint c. f. is a product of two factors, the first of which is the c. f. of a normal distribution, while the second is of the form (29.6.7), and thus corresponds to a distribution of the form (29.5.1), with A = and the matrix of variables X = L = {?o}. Denoting by S the set of all points in the ^^(i + l)-dimensional space of the variables Uj such that the symmetric matrix h is definite positive, we thus obtain the following generalization of the theorem of 29.6: The combined random variables . . ., Xk) and (In, Z,„ . . ., Zjbjb) are independent. The joint distribution of x^^ . . ., is normal, with the same first order moments as the parent distribution, and the moment matrix n^^A. The joint distribution of the \h[k •¥ 1) distinct variables lij has the fr.f fn given by (29.9.6) fn[ln.ln hk)^Ckn n—l 2 • n-fc-2 A ^ \fj lij for every point in the set S, while /n == 0 in the complementary set 5*. The constant Ckn is given by (29.5.2). 405 29.9-10 This theorem was first proved by Wishart (Ref. 240) by an ex- tension of the geometrical methods due to R. A. Fisher, and then by Wishart and Bartlett (Ref. 241) by the method of characteristic functions. We also refer to a paper by Simonsen (Ref. 213 a). 29.10. The generalized variance. — The determinant L = \lij\ re- presents the generalized variance of the sample (cf 22.7). Following Wilks (Ref. 232), we shall now indicate how the moments of L may be determined. For the explicit distribution of we refer to Kull- back (Ref. 143). The integral of the fr. f. fn in (29.9.6) over the set S is obviously equal to 1. Now the set 8 is invariant under any transformation of the form Wij = ah jy where a > 0. Taking a = and writing W = | tr/j |, we thus obtain j W e tow'll f/M’i* • dwi (2M) “ C’t„ Since this relation holds for all values of m > k, we may replace v by n + 2v and then obtain, after reintroducing the variables hj. / -k-2 dhk = n- 1 /2M\ " 1_ \ f Vn+2. After multiplication with Ckn (29.9.6) and (29.5.2), this gives, taking account of E(L-) (2^ Ay Ckn \ M* / Clt.»+3. for n + 2v > k, i. e. for any y > — i (n — A:). In particular we have E{L) = ("-'X-'-a). D’iL). k{2n \ —k)^ (w — 1 )’^ . . . ( n — i-) ' \n —~k)[n — ifc + 1) A\ For a one-dimensional distribution (^==1) we have L^ln — Mtj and 406 29.10-11 w4 = O’®, and the above expression for E(L^) then reduces to the for- mula (29.3.2). 29.11. The generalized Student ratio. — Consider now a sample from a A:-dimensional normal distribution with arbitrary mean values mj, . . ., mjfc, and denote by ttj the product moments about the population mean: 1 ” (29.11.1) li j ^ {xiv m^{xjr = lij ■!“ (^v ?Wi)(.Ty w^/), where the jci and the hj are given by (29.9.2). There are \k(k + 1) distinct variables tij. If we write = xt^ — wi/, the joint c. f. of the Vtj becomes -- 4 — ( (2 7r)^yl2‘' where i,j y~l i,j Comparing this with (29.9,3) — (29.9.5) we find that the c. f. of the lij is where A and A* denote the same determinants as in (29.9.5) . It follows that the joint fr. f. of the It j is obtained if, in w* (29.9.6) , we replace n by n -h 1, except in tb<* two factors and fl which arise from the matrix A. 2/1 Writing we then obtain by the same transformation as in the preceding paragraph for any fi> — ^(n 1 — i). — On the other hand, according to (29.1 1.1) L' is a function of the random variables h j and = xt — m/, and the joint fr. f. of all these variables is by the theorem of 29.9 407 29.11 9 (I. I) — ^ V. (I). {inyVA where /•.(*) =/n(^ii- ?ii. • • •. fct) is firiven by (29.9.6). Thus we may also write E(LV) = jV>^g(%,l)d%dl, where the integral is extended over the set S (defined in 29.6) with respect to the hj, and over (— 00 , 00 ) with respect to every Here we may now apply once more the transformation of the preceding paragraph, writing Wij=^nUj and 171 — and then replacing n by n + 2v. Equating the two expressions of E{L*f*), we then obtain for any r > 0 and fi> — v — i(»-f 1 — i) E(L^L>) Taking this reduces to Thus by (18.4.4) the variable LjV has the same moments as the Beta- distribution with the fr. f . (29.11.2) /l(*;!!.^, I) (0<x< 1 ). Since a distribution with finite range is uniquely determined by its moments (cf 15.4), it follows that L/V has the fr. f. (29.11.2). On the other hand, we obtain from (29.11.1) 408 29.11 L' = L + 2 Lij(xi — — mj), 1 ^ 1 + 2 (^< ~ ”^) i^'j — »>j) u whore Ltj is the cofactor of hj in L, The quadratic form in the de- nominator is non-negative, since L is the moment matrix of a distri- bution, viz. the distribution of the sample. — If we now introduce a new variable T bj writing (29.11.3) r* = (m - 1) 2 x ij where T ^ 0, we have L 1 and by a simple transformation of (29.11.2) the fr. f. of T is found to be (29.11.4) [x > 0). For h = \, this reduces to the positive half of the ordinary Student distribution (18.2.4) with n — \ degrees of freedom. The distribution of T has been found by Hotelling (Ref. 126), and the above proof is due to Wilks (Ref. 232). Just as the ordinary Student ratio t may be used to test the signi- ficance of the deviation of an observed mean x from some hypothetical value m, the generalized Student ratio T provides a test of the joint deviation of the sample means iCj, . . Xk from some hypothetical system of values . . ., w*. In 29.4, we have shown how the Student ratio may be modified so as to provide a test of the difference between two mean values. An analogous modification may be applied to the generalized ratio T. ^ Suppose that we are given two samples of n, and individuals respectively, drawn from the same /c-dimensional normal population, and let denote the means, variances and covariances of the two sampleF'. l^et further H denote the matrix 409 29.11-13 H= (n, l^|^ + = n, i, + «jL„ while if and y are the corresponding determinant and its cofuctors. Writing .29.11.6) fT* = V “ iro,)U’j ; — .rg ) + '^1 H ‘ ■ where T _* 0, it can be shown by the same methods as above that U has the fr. f. (29.11.4) with n replaced by riy -f — 1. The expression (29.11.6) is entirely free from the parameters of the parent distribution, so that U can be directly calculated from a sample and used as a test of the joint divergence betw'een the two systems j?jy and of sample means. For A: = 1, it will be seen that U' is identical with as defined by (29. 4.2'. 29.12. Regression coefficients. — For a two-dimensional distribu- tion we have seen that the variable (29.8.4), which is simply connected with a sample regfression coefficient, has the ^distribution with n — 2 d. of fr. This result has been gfeneralized by Bartlett (Ref. 54) to distributions in any number of dimensions. Replacing in (23,2.3) and (23.4.5) the population characteristics by sample characteristics, we obtain for the regression coefficient ..k the expressions /O'i -3 4 ''•‘l ^ /^12 -‘‘a ^11 •''‘I • 34 . Fi>.34 . k •‘♦’a • 34 . k . k where the residual variances .s' may be calculated from the sample correlation coefficients r as shown by the first relation (23.4.5). If /?i 2 34 ...A denotes the population value of the regression co- efficient, the variable (29.12.1) n—k ^ ^ (?>12-34 . k — /^l2-34 k) .S‘l . 23 k has Student’s distribution with n — k d. of fr. In the same way as in the case of (29.8.4), we can thus obtain a test of significance for the deviation of the observed value h of a regression coefficient from any hypothetical value (Cf 31.3, Ex. 7.) 29.13. Partial and multiple correlation coefficients. — We now proceed to some further applications of the distribution (29.9.6), re- stricting ourselves to the particular case when the k variables in the normal parent distribution are independent. In this case Xij, Qij and Aij all reduce to zero for i ^ y, so that the moment matrix ^ is a dia- gonal matrix, while the correlation matrix P is the unit matrix (cf 22.3). 410 29.13 In the joint distribution (29.9.6) of the hj, we replace the hj with hy the sample correlation coefBcients by means of the sub- stitution Uj=^rijVhiljj- We then have L — l^^ ^22 • • hkH, where Ti = \ rij\ is the determinant of the correlation matrix R of the sample. The Jacobian of the transformation (cf the analogous transformation 29.5.3) is {/ii . . . and the joint fr. f. of the variables la and Vij becomes by (22.2.3), in the particular case considered here, ( k w — 8 2 * for U i > 0 and all values of the Vij such that the matrix R is definite positive. For all other values of the variables, the fr. f. is zero. We can now directly integrate over (0, 00 ) with respect to every Z*/. After introduction of the value (29.5.2) of Cjtn, we obtain the joint fr. f. of the sample correlation coefficients Vij: (29.13.1) Irh-r .y-i U /,/-2\ ( 2 / '\ -/I According to the terminology of Frisch (Ref. 113), the determinant H * i the square of the scatter coefficient of the sample (cf. 22.7). The m flients of R may be determined by the method of 29.10. Denoting - fc -2 by Bkn the factor of R ^ in (29,13.1), we find, e. g., Bkn (n — 2)(n — 3) . . . (n — k) E{R) = {29.13.2) Bk,v + 2 (n-1)^-"^ D*(7?) = k(k — 1) + O The partial correlation coefficient between the sample values of the variables x'l and x^, after elimination of the remaining variables .Tj, x^, . . Xk, is by (23.4.2) (29.13.3) »’12 - Si ... * Vli„R 11 where the Rtj are the usual cofactors of R. In the particular case of an uncorrelated parent distribution considered here, the corresponding population value ^ 12 . 34 . .k is, of course, equal to zero. 411 29.13 In order to find the distribution of rv 2 .u.,.ki we regard (29.13.3) as a substitution replacing ri 2 bj a new variable vu- while all the Vij except ri 2 are retained as variables. Bn and B 2 % do not involve 7\2i a,nd thus (29.13.3) can be written, using notations analo- gous to those of 11.5, rn • 34 B 11 . 23 r VBii ^22 12 Q, where Q does not involve 7\2- This shows that there is a one-to-one correspondence between the two sets of variables. The Jacobian of the transformation is t Ru n From (11.7.3) and (29.13.3) we further obtain p ■^11-^22/1 M — - (1 — ? J2 -34 . ■tin ■ 22 Introducing the substitution (29.13.3) in (29.13.1), we thus find that the joint fr. f. of ^ 12.34 .1 and all luj other than is ti-A-l MiM. 22 / n—k R 2 ^*^11 • 22 (1 ■ h2 • 34 -i) where C is a constant. This is the product of two factors, one of which depends only on ri 2.34 .. while the other depends only on the nj. Since the variable ri 2 . 34 ...jt obviously ranges over the whole interval (—1, 1), the multiplicative constant in its fr. f . is easily de- termined, and we have by (22.1.2) the following theorem: The partial correlation coefficient rn-u.. k is independent of all the rij othe9* than r^, and has the fr.f (29.13.4) (— 1 < rr < 1). We observe that by (29.7.5) the total correlation coefficient ri^ has in the present case the fr.f. 412 29.13 1 \ 71 In order to pass from the distribution of to the distribution of Aa- 34 ...jfc, we thus only have to replace n by « — (^ — 2), i. e. to sub- tract from n the number of variables eliminated. R. A. Fisher (Ref. 93) has shown that this property subsists even in the g^eneral case when the variables in the parent distribution are not independent. In the case of independence, it follows (cf 29.7) that the variable t= V~n where r = na-a* . . it, has Student’s distribution with n — k d. of fr. Consequently the inequality (29.13.5) 1 ’ 1 tp n — k where tp is the p % value of t for n — k d. of fr., has the probability p %. (Cf 31.3, Ex. 7.) The multiple correlation coefficient ri ( 2 . . . k) between the sample values of Xi and (xg, . . iCjt) is, by (23.5.2), the non-negative square root (29.13.6) = The corresponding population value pi ( 2 . .*.) is, in the present case of an uncorrelated normal parent distribution, equal to zero. We now propose to find the distribution of ri (2 . .i). In the joint distribution (29.13.1) of the r, j, we replace the I 1 variables ri 2 , • • •» n* by the k new variables r = r]( 2 ...fr) and ^ 2 , • • ‘8'fc, by means of the relations (29.13.6) and rii=^ Sir, (/ = 2, 3, . . ., k). Between the new variables, we then have by (11.5.3) the relation k ij=i by which one of the say .^' 2 , may be expressed as a function of the other ei and the nj with ^ > 1 and ^ > 1. The Jacobian of this 413 29.13 transformation is dz^ d Z2 Ozt dra dr^t ^2 ’’ dza ^ dz^ Ozk dr dzk ^3 r 0 0 d r j jt dr dZa drik dzk ^4 0 r • • 0 0 0 r where Q' does not involve r. Further, we obtain from (29.13.(1) = — r*), and thus the introduction of the above substitution in (29.13.1) yields an expression of the form for the joint fr. f . of the new variables, where Q" does not involve r. Thus the multiple correlation coefficient ri (2 .a) is independent of all the rij ivith i > 1, y > 1, and has the fr.f. (29.13.7) (0 < .T< 1). The square r® has the Beta-distribution with the fr. f. (29.13.8) /?(./;; * ^ , t-3 n~k-2 The distribution of r was found by R. A. Fisher (Ref. 94), who also (Ref. 98) solved the more, general problem of finding this distri- bution in the case of an arbitrary normal parent distribution. In this general case, the fr. f. of r may be expressed as the product of the function (29.13.7) with a power series containing the population value it), in a similar way as in the case of the ordinary correlation coefficient (cf 29.7.1). Let us finally consider the behaviour of the distribution of r® for large values of n. The variable n r* has the fr. f. 414 29.13 When « -> 00 , this tends to the limit (29.13.9) jr which is the fr. f. of a ;c®-distribution with A: — 1 d. of fr. (cf 31.3, Ex. 7). Thus the distribution of r* does not tend to normality as w~*oo. Accordingly, we obtain from (29.13.8) E(r*)-. it - 1 n — 1 1) (w ~ Ar) ^ ’ (n -!)*(«+“][ so that we have here an instance of the exceptional case mentioned at the end of 28.4, where the variance is of a smaller order than and the theorem on the convergence to normality breaks down. This^ takes, however, only place in the case considered here, when the po- pulation value Q is equal to zero. When q 0, the variance of r* is- of order and the distribution approaches normality as n 415 Chapters 30-31. Tests of Significance, I. CHAPTER 30. Tests of Goodness of Fit and Allied Tests. 30.1. The test In the case of a completely specified hypothetical distribution. — We now proceed to study the problem of testing the agreement between probability theory and actual observations. In the present paragraph, we shall consider the situation indicated in 26.2, when a sample of n observed values of some variable (in any number of dimensions) is given, and we want to know if this variable can be reasonably regarded as a random variable having a given probability distribution. Let us denote as htjpotheds H the hypothesis that our data form a sample of n values of a random variable with the given pr. f . P(>S). We assume here that P{S) is completely specified, so that no unknown [larameter appears in its expression, and the probability P[S) may be numerically calculated for any given set S. It is then required to work out a method for testing whether our data may be regarded as consistent with the hypothesis If. If the hypothesis H is true, the distribution of the sample (cf 25.3), which is the simple discrete distribution obtained by placing the mass 1/w in each of the n observed points, may be regarded sl statistical image (cf 25.5) of the parent distribution specified by P{S). Owing to random fluctuations, the two distributions will as a rule not coin- cide, but for large values of n the distribution of the sample may be expected to form an approximation to the parent distribution. As already indicated in 26.2, it then seems natural to introduce some measure of the deviation between the two distributions, and to base our test on the properties of the sampling distribution of this measure. Such deviation measures may be constructed in various ways, the most generally used being that connected with the important introduced by K. Pearson (Ref. 183). Suppose that the space of the variable is divided into a finite number r of parts 5x, . . Sr without common points, and let the corresponding values of the given pr. f. 416 30.1 be /);■, so that pi = P(S,) and = 1, We assume that all 1 the j)i are > 0. The r parts Si may, e. g., be the r groups into which our sample values have been arranged for tabulation purposes. Let the corresponding group frequencies in the sample be rj, . . Vr, so ;• that sample values belong to the set S';, and we have — 1 Our first object is now to find a convenient measure of the devia- tion of the distribution of the sample from the hypothetical distribu- tion. Any set Si carries the mass V;/?? in the former distribution, and the mass pi in the latter. It will then be in conformity with the general principle of least squares (cf 15.6) to adopt as measure r of deviation an expression of the form '^c,{v,/n — pi)^ where the i coefficients c, may be chosen more or less arbitrarily. It was shown by K. Pearson that if we take Ci — n/jh. we shall obtain a deviation measure with particularly simple properties. We obtain in this way the expression y _ T' npt ^>iPi Thus simply expressed in terms of the observed frequencies v, and the expected frequencies n jh for all r groups. We shall now investigate the sampling distribution of assuming throughout that the hypothesis H is true. It will be shown that we have (30.1 .]) jy, (j,.) _ 2 {r - 1) + I - r* - 2 r + ‘ij • We shall further prove the following theorem due to K. Pearson (Ref. 183) which shows that, as the size of the sample increases, the sampling distribution of x* tends to a limiting distribution completely independent of the hypothetical pr. f . P(S). As oo, the sampling distnhution of yf tends to the distribution defined by the fr.f (30.1.2) 2~ r fi*)' (.r > 0) 27 — 454 H, Cramer 417 30.1 studied in 18.1. — Using the terminology introduced in 18.1 and 29.2, ive may thus say that^ i?i the limit, distributed in a y^-distrihution with r — 1 degrees of freedom. At each of the n observations leading to the n observed points in our sample, we have the probability pi to obtain a result belonging to the set Si. For any set of non-negative integers . . ., Vn such r that 2 == probability that, in the course of n observations, 1 we shall exactly Vi times obtain a result belonging to St, for i ~ 1, . . ., r, is then (cf Ex. 9, p. 318) which is the general term of the expansion of (^i -r * + pr)^. Thus the joint distribution of the r group frequencies Vj, . . ., Vr is a simple generalization of the binomial distribution, which is known as the multinomial distribution. The joint c. f. of the variables vj, . . Vr is + 1 - Jh as may be directly shown by a straightforward generalization of the proof of the corresponding expression (16.2.3) in the binomial case. Writing (30.1.3) X, = (/ = 1, 2, . . ., j), V 71 Pi r it is seen that the x, satisfy the identity 2 ^ Pt ~ that we 1 have Further, the joint c. f. of the variables Xr is g){ti tr) = e ■iV pf \Pie i U_ V'np, + PrC y j From the MacLaurin expansion of this function, we deduce by some easy calculation the expressions (30.1.1). We further find for any fixed fi, . . ., tr 418 30.1 log 9>(<i, . . ^r) = = « log 1^1 + 2 U -±2i) + o («-v)j - — iVn^tiVpt SO that the c. f. tends to the limit lim q) (fi, n-»oo tr) U ^Pij has the matrix A=^l — pp\ where I denotes the unit matrix (cf. 11.1), while p de- notes the column vector (cf 11.2) p = V />r)- Replacing fi, if by new variables Ur by means of an orthogonal r transformation such that Pu we obtain (cf 11.11) The quadratic form Q(ti, . . tr) = r _y r~l 1 / 1 It follows that (l>(fi, . . ., fr) is non-negative and of rank r — 1 (cf 11.6), and that the matrix A has r — 1 characteristic numbers (cf 11.9) e(|ual to 1, while the r:th characteristic number is zero. As n oo, the joint c. f, of the variables Xr thus tends to the expression which is the c. f. of a singular normal distribu- tion (cf 24.3) of rank r — 1, the total mass of which is situated in the hyperplane = By the continuity theorem 10.7 it then follows that, in the limit, Xr are distributed in this singular normal distribution, with zero means and the moment matrix A. It r then follows from 24.5 that, in the limit, the variable ^ is 1 distributed in a ^^’distribution with r — 1 d. of fr. Thus the theorem is proved. By means of this theorem, we can now introduce a test of the hypothesis H considered above. Let Xv denote the p % value of x^ 419 30.1-2 for r — 1 d. of fr. (cf 18.1 and Table 3). Then bj the above theorem the probability P = P{x* > Xp) for largo w be approximately equal to %. Suppose now that we have fixed p so small that we agpree to regard it as practically certain that an event of probability p % will not occur in one single trial (cf 26.2). Suppose further that n is 80 large that, for practical purposes, the probability P may be identi- fied with its limiting value p %. If the hypothesis H is true, it is then practically excluded that, in one single sample, toe should encounter a value of x^ exceeding xl- If, in an actual sample, we find a value x^ ^ xlt shall accord- ingly say that our sample shows a significant deviation from the hypo- thesis H, and we shall reject this hypothesis, at least until further data are available. The probability that this situation will occur in a case when H is actually true, so that H will be falsely rejected, is precisely the probability P= P(x^ > Xp)y which is approximately equal to p %. We shall then say that we are working on a p % level of significance. If, on the other hand, we find a value x^ = Xp> will be re- garded as consistent with the hypothesis H, Obviously one isolated result of this kind cannot be considered as sufficient evidence of the truth of the hypothesis. In order to produce such evidence, we shall have to apply the test repeatedly to new data of a similar character. Whenever possible, other tests should also be applied. When the test is applied in practice, and all the expected fre- quencies npi are ^10, the limiting ;c*-di8tribution tabulated in Table 3 gives as a rule the value Xp corresponding to a given P = pi 100 with an approximation sufficient for ordinary purposes. If some of the npi are < 10, it is usually advisable to pool the smaller groups, so that every group contains at least 10 expected observations, be- fore the test is applied. When the observations are so few that this cannot be done, the x^ tables should not be used, but some informa- tion may still be drawn from the values of E and D [y^) calculated according to (30.1.1). Table 3 is only applicable when the number of d. of fr. is ^ 30. For more than 30 d. of fr., it is usually sufficient to use Fisher’s proposition (cf 20.2) that V2x^ for n d. of fr. is approxi- mately normally distributed, with the mean V2n — l and unit s. d. 30.2. Examples. — In practical applications of various tests of significance, the 5 %, 1 % and O.i % levels of significance are often 420 30.2 used. Which level we should adopt in a given case will, of course, depend on the particular circumstances of the case. In the numerical examples that will be given in this book, we shall denote a value exceeding the 5 % limit but not the 1 % limit as almost s?ginfica)it^ a value between the 1 % and O.l % limits as sigyiificaat, and a value exceeding the O.l % limit as highly significant. This terminology is, of course, purely conventional. Ex. 1. In a sequence of n independent trials, the event K has oc- curred V times. Are these data consistent with the hypothesis that has in every trial the given probability p = 1 — ry? The data may be regarded as a sample of n values of a variable which is equal to 1 or 0 according as E occurs or not. The hypo- thesis H consists in the assertion that the two alternatives have fixed probabilities p and q. Thus we have two groups with the observed frequencies v and n — r, and the corresponding expected frequencies np and )? q. Hence we obtain (OQ 9 I \ , 5 ^ y ^ (y — n pY ^ np ^ nq npq By the theorem of the preceding paragraph, this quantity is for large )i approximately distributed in a ;<”-di8tribution with one d of fr. This agrees with the fact (cf 1(>.4 and 18.1) that the standardized variable r - - is asymptotically normal (0, 1), so that its square has, I npq in the limit, the fr. f . Aq (x). Accordingly, the ])erceniage values of x" for one d. of fr, given in Table 3 are the squares of the corresponding values for the normal distribution given in Table 2. In n = 4040 throws with a coin, BufiPon obtained v = 2048 heads and n — v — 1992 tails. Is this consistent with the hypothesis that there is a constant probability == i of throwing heads? — We liave here 7“ = '“ — 0.77(>. and this falls well below the 5 % npq value of X' for one d. of fr., which by Table 3 is 3.841, so that the data must be regarded as consistent with the hypothesis. The corres- ponding value of ~ P(x“ = 0.776) is about 0.38, which means that we have a probability of about 38 % of obtaining a deviation from the expected result at least as great as that actually observed. Ex. 2. Suppose now that k independent sets of observations are available and let these contain .... nk observations respectively, 421 30.2 the corresponding numbers of occurrences of the event E being 9^1, . . Vk- The hypothesis of a constant probability equal to p may then be tested in various ways. The totality of our data consist of « = S w, observations with V = S V/ occurrences, so that we obtain a first test by calculating the - — . Further, the quantity x^i = ^ “ provides npq fupQ quantity x* = ( a separate test for the r.ih. set of observations. Then Xu ^ Xk ft^e independent, and for large Vi all have asymp' totically the same distribution, viz. the distribution with one d. of fr. By the addition theorem (18.1.7) the sum hxl has, in the limit, a x^ distribution with k d. of fr., and this gives a joint test of all our ;f*-values. Finally, when the 'tii are large, Xu • be regarded as a Sample of k observed values of a variable with the fr. f . k^ (x), and we may apply the x^ fo judge the deviation of the sample from this hypothetical distribution. In his classical experiments with peas, Mendel (Ref. 1 55) obtained from 10 plants the numbers of green and yellow peas given in Table 30.2.1. According to Mendelian theory, the probability ought to be p = l for »yellow», and ^ = i for »green'> (the »3:1 hypothesis*). The ten values of xh as well as the value = 0.137 for the totals, all fall below the 5 % value for one d. of fr. The sum of all ten x) is 7.191, and this falls below the b % value for ten d. of fr., which by Table 3 is 18.307. Finally, the ten values of x' ™ay be regarded as a sample of ten values of a variable with the fr. f. ki (x). For this distribution, we obtain from Table 3 the following probabilities: P(0 <x^ < 0.148) = 0.3, P(0.J48 < %* < 1.074) = 0.4, P(X^ > 1.074) = 0.3, while according to the last column of Table 30.2.1 the corresponding observed frequencies are respectively 2, 6 and 2. The calculation of X^ for this sample of w = 10 observations with 7' = 3 groups gives X^ ^{2 ^ 3)2/3 + (6 — 4)V4 + (2 — 3)V3 = 1 .667. In this case, the ex- pected values are so small that the limiting distribution should not be used, but we may compare the observed value x* = 1-^67 with the values E(x^)^2 and D(x*) = 1.902 calculated from (30.1.1). Since the ob- served value only differs from the mean by about 18 % of the s. d., the agreement must be regarded as good. 422 30.2 Table 30.2.1. Plant number i Number of peas 7a Yellow Green Total «< 1 25 11 36 0.698 2 82 7 1.084 3 14 5 19 0.018 4 70 27 97 0.416 6 24 13 37 1 2.027 6 20 6 26 0.051 7 : 13 45 0.868 8 44 9 53 1.818 9 60 14 j 64 0.888 10 , 44 18 62 0.588 Total 355 1 123 478 7.191 y^ for the totals — 0.187 Thus all our tests imply that the data of Table 30.2.1 are con- sistent with the 3:1 hypothesis. If either test had disclosed a signi- ficant deviation, we should have had to reject the hypothesis, at least until further experience had made it plausible that the deviation was due to random fluctuations. Ex. 3. In another experiment, Mendel observed simultaneously the shape and the colour of his peas. Among w = 556 peas he obtained: Round and yellow 315, (expected 312.76), Round and green 108, ( » 104.26), Angular and yellow 101, ( » 104.26), Angular and green 32, ( » 34,76), where the expected numbers are calculated on the hypothesis that the probabilities of the ?* = 4 groups are in the ratios 9 : 3 : 3 : 1. From these numbers we find = 0.470. We have — 1 — 3 d. of fr., and by Table 3 the probability of a exceeding 0.4 70 lies between 90 and 95 %, so that the agreement is very good. Ex. 4. We finally consider an example where the hypothetical distribution is of the continuous type. Aitken (Ref. 2, p. 49) gives the 423 30.2-3 following^ distributions of times shown bj two samples of 500 watches displayed in watchmakers’ windows (hour 0 means 0—1, etc.): Table 30.2.2. On the hypothesis that the times are uniformly distributed over the interval (0, 12), the expected number in each class would be 500/12 = = 41,67, and hence we find Xt = lO.ooo for the first sample, and xl = 8.082 for the second, while for the combined sample of all 1 000 watches we have = 9-464. In each case we have 12—1 = 11 d. of fr., and by Table 3 the agp-eement is good. We may also consider the sum x? + JC? ~ 18.082, which has 22 d. of fr., and also shows a good agreement. 30.3. The x^ when certain parameters are estimated from the sample. — The case of a completely speciBed hypothetical distri- bution is rather exceptional in the applications. More often we en- counter cases where the hypothetical distribution contains a certain number of unknown parameters, about the values of which we only possess such information as may be derived from the sample itself. We are then given a pr. f . P(S; . . ., a^) containing s unknown parameters a,, . . ., or«, but otherwise of known mathematical form. The hypothesis H to be tested will now be the hypothesis that our sample has been drawn from a population having a distribution deter- mined by the pr. f . P, with some values of the parameters aj. As in 30.1, we suppose that our sample is divided into r groups, corresponding to r mutually exclusive sets . . ., 5r, and we denote the observed group frequencies by Vj, . . .. Vr, while the corresponding probabilities are Pi[€c^, . . ., a,) = P(Sf; or^, . . a^) for ^ = 1, 2, . . ., r. If the »true values » of the aj were known, we should merely have to calculate the quantity (30.3.1) — wpi(gi a.)]* up, {at, . . «,) 424 30.3 and apply the test desciibed in 30.1, so that no further discussion would be required. In the actual case, however, the values of the aj are unknown and must be estimated from the sample. Now, if we replace in (30.3.1) the unknown constants aj by estimates calculated from the sample, the Pi will no lopger be constants, but functions of the sample values, and we are no longer entitled to apply the theorem of 30.1 on the limiting distribution of X“. As already pointed out in 26.4, there will generally be an infinite number of different possible methods of estwia^ tion of the cj, and it must be expected that the properties of the sampling distribution of yf will more or less depend on the method chosen. The problem of finding the limiting distribution of yf under these more complicated circumstances was first considered by R. A. Fisher (Ref. 91, 95), who showed that in this case it is necessary to modify the limiting distribution (30.1.2) due to K. Pearson. For an im- portant class of methods of estimation, the modification indicated by Fisher is of a very simple kind. It is, in fact^ only necessary to reduce the number of d. of fr. of the limiting distribution (30.1.2) by one unit for each parameter estimated from the sample. We shall here choose one particularly important method of estima- tion, and give a detailed deduction of the corresponding limiting distribution of yf. It will be shown in 33.4 that there is a whole class of methods of estimation leading to the same limiting distribution. It seems natural to attempt to determine the »be8t» values of the parameters oLj so as to render yf defined by (30.3.1) as small as pos sible. This is the minimum method of estimation. We then have to solve the equations (30.3.2) dx' _ (v , — np, {v, — Op, ^ where j = 1, 2, . . ., 6*, with respect to the unknowns . . ., and insert the values thus found into (30.3.1). The limiting distribution of for this method of estimation has been investigated by Neyman and E. S. Pearson (Ref. 170), who used methods of multi-dimensional geometry of the type introduced by R. A. Fisher. We also refer in this connection to a paper by Sheppard (Ref. 213). Even in simple cases, the system (30.3.2) is often very difficult to solve. It can, however, be shown that for large n the influence of 425 30.3 the second term within the brackets becomes negligible. If, when differentiating with respect to the q/, we simply regard the denomi- nators in the second member of (30.3.1) as constant, (30.3.2) is replaced by the system (30.3.3) ^ ^ - )! pi 0 Pi _ ^ ~ p, ‘ i-i ^ •' (y-i, and usually this will be much easier to deal with. The method of estimation which consists in determining the aj from this system of equations will be called the modified x“ minimum method. Both methods give, under general conditions, the same limiting distribution of for large n, but we shall here only consider the simpler method based on (30.3.3). Hy meanH of the condition of the theorem jriven below, the equations ,30.3.3 reduce to (30.3.3 a) .^1 0 , which may also l)e >vritten — 0, where L — . . . p'f The method of estima- tioii which consists in determining the such that L be<;omes as large as possible is the maximum likelihood method introduced by R. A. Fisher, which will be further discussed in (’h. 33. With respect to the problem treated in the pre.sent paragraph, the modiiled minimum method is thus identical with the maximum likelihood method. The latter method is, however, applicable also to problems of a much more general character. On account of the importance of the question, we shall now give a deduction of the limiting distribution of under as general condi- tions as possible, assuming that the parameters ccj are estimated by the modified y^ minimum method. We first give a detailed statement of the theorem to be proved. Suppose that tee arc giroi r functions p^[(x^, . . ., «»), . . of s < r variables oTj, . . ., such that, for aft points of a non-degenerate interval A in the s dhnensionaJ space of the aj, the p, satisfy the following conditions: a) 2 • • •> «') 1- #=1 b) pi[a^, . . Qf^) > c~ > 0 for all i. 42() 30.3 c) Every pi has continuous denvatives and • a aj dojOak d) The matrix D = tvhere « = 1, . . r and = 1, . . s, is of rank s. Let the possible results of a certain random experiment 6 he divided into r mutually exclusive groups^ and suppose that the prohahility oj obtaining a result belonging to the i:th group is pj == (a?, . . aj), fvhere «o ” • • •> inner point of the interval A. Let V£ denote the number of results belonging to the i:th group ^ which occur in a se~ r quence of n repetitions of te, so that ^Vi=^n. 1 The equations (30.3.3) of the modified minimum method then have exactly one system of solutions « = a«) suck that a converges in probability to «o w -► oo. The value of x^ obtained by inserting these values of the aj into (30.3.1) is^ in the limit as w -► distributed in a X^- distribution with r — s — 1 degrees of freedom. The proof of this theorem is somewhat intricate, and will be divided into two parts. In the first part (p. 427 — 431) it will be shown that the equations (30.3.3) have exactly one solution a such that a converges in probability (cf 20.3) to «o* second part (p. 431 — 434) we consider the variables (30.3.4) n Pi (a^ «^) y npi[(x^, . . Of,) (e == 1, . . ., r), where « = («!,..., or,) is the solution of (30.3.3), the existence of which has just been established. It will be shown here that, as ;/ -> oo, the joint distribution of the yt tends to a certain singular normal distribution of a type similar to the limiting distribution of the variables Xi defined by (30.1.3). As in the corresponding proof in 30.1, the limiting distribution of ;c*=2*^* then directly obtained 1 from 24.5. Throughout the proof, the subscript ? will assume the values 1,2,. . 7 *, while j and k assume the values 1, 2, . . ., 5. We shall first introduce certain matrix notations, and transform the equations (30.3.3) into matrix form. Denoting by the value \() aj/o 427 30.3 assumed bj in the point (30.3.3) may be written ,30.3.5, = where ».<«)- (30.3.6) Let us denote by B the matrix of order r • s L Pi?i\ Vp<l\daJo _L (^\ ... L (^\ By 11.1, we have B — Pq Dq, where Pq is the diagfonal matrix formed by the diasfonal elements . . . ., while Dq is the matrix ob- * Vpl Vpf tained by taking aj = a] in the matrix D = |^^|- by condition d) the matrix B is of rank s (cf 11.6). — We further write in analogy with (30.1.3) (30.3.7) Vi — npi V np'i and denote by a, «(,< ^(<<) ^tid x the column vectors (cf 11.2) « ='.(«i. • • •, «»). «o = («*. • • •> «•). <«(«) =• (w,(a), . . ., to, id)), X = (Xj, . . ., Xr), the three first of which are, as matrices, of order s ■ 1 , while the- fourth is of order r ■ 1 . 428 30.3 In matrix notation, the system of equations (30.3.6), where / = 1, . . ., 5, may now be written (cf 11.3) jB' J3 (a — Oq) = wi B' X + o} (a). B' B is a symmetric matrix of order s-s, which according to 11.9 is non-singular, so that the reciprocal (B'B)”^ exists (cf 11.7), and we obtain ^) (30.3.8) « = Co + + (B’B)-> w(a). This matrix equation is thus equivalent to the fundamental system of equations (30.3.3). For every fixed i the random variable r* has the mean np'} and the s. d. y wj>f(i — p'i), so that by the Bienaym^-Tchebycheff inequality (15.7.2) the probability of the relation \vi — iV n is at most equal to Consequently the probability that we have I Vf — w p? I ^ A V n for at least one value of i is smaller than and, conversely, with a probability greater than 1 — 1 we have (30.3.9) I Vi — np'l I < A Kw for all t = 1 , . . Until further notice, we shall now assume that the Vt satisfy the relations (30.3.9). We shall here allow A to denote a function of n such that A tends to infinity with n, while k^/V n tends to zero. We g- take A = where 0 < g < J. — All results obtained under such assumptions will thus be true with a probability which is greater than 1 — * A“‘^, and which consequently tends to \ as n-^ From (30.3.7) we then obtain by condition b) (30.3.10) k.|<J- Further, when a = (aj, . . ., a',) and n' = («'/, . . a^) are any points in *) Note that we cannot write here (B'B)-i = B-i(B')“i, since by hypothesis « < r, so that B is not square, and the reciprocal B-i is undefined. — If we take s = r, it will be seen that the conditions a) and d) of the theorem are incompatible. In this case, if we assume that a) — c) are satisfied, the matrices B, B and B' B are all sinprnlar, so that the reciprocal (B'B)-^ is undefined, and (80.8.8) has no sense. 429 30.3 the interval Jy we obtain from (30.3.6) after some calculations, usingr the conditions b) and c), and expanding in Taylor’s series, (30.3.11) — ^ — «"l • |l«'— «ol + l«"-«ol In the second member, we use the notation | a — 6 | for the distance (cf 3.1) between two points a and b in the .v-dimensional space of the ajy while is a constant independent of j and w. We now define a sequence of vectors . . ., a^’'^) by writing for v = 1, 2, . . . (30.3.12) «. = «„+ u-i(B'B)-‘B'* + (B'B)-> ».>(«»_,), and we propose to show that the sequence n^y • • • converges to a definite limit «, which is then evidently a solution of (30.3.8). By (30.3.6) we have (^o) “ thus (30.3.13) -«„=M-i(B'B)-’B'*, while for v > 0 (30.3.14) ar HI — — (B' B)"' [co {ay) — o> («r-i)]. Now the matrices (B'B)~^B' and (B'B)“^ are both independent of??. Denoting by g an upper bound of the absolute values of the elements of these two matrices, it then in the first place follows from (30.3.13) and (30.3.10) that every element of the vector rti — satisfies the inequality <r.^ I Vn' so that a. -- < V n where is independent of n. In a similar way, it then follows from (30.3.14) and (30.3.11) that w^ have 1 + l «*■ 1 ^ J^8 I ^1- I • I H- I riv-i + X y V rii for every j' > 0, where is independent of v and n. From the two last inequalities, it now follows by induction that we have for all sufficiently large n, and for all v = 0, 1, 2, . . . 430 30.S (30.3.15) I 1 - «» I ^ [(4 + 1) Ji:,]* Since bj hypothesis «o is an inner point of the interval -4, it follows that for all sufficiently large n the vectors a,, Wg, . . . (considered as points in the «-space) all belong to A, and that the sequence ttj, . . . converges to a definite limit (30.3.16) rt = «o + — (Xq) + («s — «i) H which, as already observed, is a solution of (30.3.8), and thus also of the fundamental equations (30.3.3). It follows from (30.3.15) that n as w — > oo . Moreover, a is the only solution of (30.3.8) tending to «o as CO. In fact, if a is another solution tending to «o, we have a — a = {B' B)~^ (ct> (a) — a> (a)), and by the same argument as above it follows that I «' — « I ^ A's I «' — « I • ^1 — rto I + I « — «o I + ' where the expression within the brackets tends to zero as n but this is evidently only possible if a ^ n for all sufficiently large n. All this has been proved under the assumption that the relations (30.3.9) are satisfied, and thus holds with a probability which is greater than 1 — and consequently tends to 1 as <». We have thus established the existence of exactly one solution of (30.3.8), or (30.3.3), which converges in probability to ^^nd the first part of the proof is completed. Still assuming that the relations (30.3.9) are satisfied, we obtain from (30.3.8), (30.3.13) and (30.3.16) (B' B)~^ iii («) = « — = («g — a^) + [a^ ■— -f • . It then follows from (30.3.15) that every component of the vector (B' B)“^ (r> («) is smaller than where is independent of so that (30.3.8) may be written (30.3.17) « - «o = W-* (B' B)-> B' * + 0„ where d’,) denotes a column vector such that j^l ^ 1 for .1 = 1, . . s. 431 30.3 Consider now the variables yi defined by (30.3.4). Still assuming that the relations (30.3.9) are satisfied, we obtain by means of (30.3.7), (30.3.10) and (30.3.17) Expressing this relation in matrix notation, we obtain y = * — K w JB (n — «o) + \ 11 where y = {y^s . . Vr) and 0^ = ((97, . . ., 6”) with | ^ 1, while /l" is independent of n. Substituting here the expression (30.3.17) for €t — we obtain (30.3.18) V n K =-[f-B(B'B)-‘B']* + V n where / is the unit matrix of order r r, and 6 = (0|, . . ., 5^) with |0/| ^ 1, while K is independent of n. We now drop the assumption that the relations (30.3.9) are satis- fied, and define a vector z = (-^i, . . ., ^^) by writing y = Ax z, where A denotes the symmetric matrix A=I-B[BB)-B\ It then follows from (30.3.18) that, with a probability greater than 1 — we have \zi\^Kl*lVn for all so that z converges in probability to zero. Further, it has been shown in 30.1 that the variables Xr are, in the limit as ?? -► oo, normally distributed with zero means and the moment matrix A = I — pp\ where p = {Vp^i, . . ., Vpr), By the last proposition of 22.6 it then follows that the limiting distribution of y is obtained by the linear trans- formation y=Axy where * = (a^j, . . ., Xr) has its normal limiting distribution, with the moment matrix A of rank r— 1. 432 30.3 Bj 24.4, the joint limiting distribution of is thus nor- mal, with zero means and the moment matrix ^ ^ = [J ~ B (B' B)- 1 Bl [/ - p p ] [/ - B (B' B)- » B ] Now by condition a) the j:ih element of the vector B'p is so that B'p is identically zero. Hence we find on multiplication that the moment matrix of the limiting y-distribution reduces to (30.3.19) A^A' = I-pp-B(B' B)“» B'. It now only remains to show that this symmetric matrix of order r r has r — s — 1 characteristic numbers equal to 1, while the rest are 0, so that the effect of the last term is to reduce the rank of the matrix by s units. It then follows from 24.5 that the sum of squares t is, in the limit, distributed in a ^^Mistribution with r — s — 1 degrees of freedom, so that our theorem will be proved. For this purpose we first observe that, by 11.9, the ^ characteristic numbers xj of the symmetric matrix B'B are all positive. Writing where pj > 0, and denoting by M the diagonal matrix formed by the diagonal elements /Zp . . ., we may thus by 11.9 find an orthogonal matrix C of order s • s such that C' B' B C = M*, and hence (B' B)"' = (CM* C')"' = CM"' • C'. It follows that (30.3 20) B (B' B)“' B' = B M"' C' B' - HH\ where H = B C M~^ is a matrix of order r-.v such that H'H = M-^CB' BCM’' = M”^M*M-' = /, denoting here by / the unit matrix of order s*s. The last relation signifies that the s columns of the matrix H satisfy the orthogonality relations (11.9.2). Further, we have shown above that B'p == 0, and hence ff'p = M“^ C'B'p = 0. Thus if we complete the matrix H by an additional column with the elements Vpi, . . ., l^pr, the tV + 1 co- lumns of the new matrix Hj will still satisfy the orthogonality rela- 433 28 — 464 H. CramM" 30.3-4 tions. Since ^ < r, we may then by 11,9 find an orthogonal matrix K of order r • r, the s + 1 last columns of which are identical with the matrix Hj. Then JC'p is a matrix of order r*l, i. e. a column vector, and it follows from the multiplication rule that we have p = (0, . . 0, 1). Thus the product K'ppK = (0, . . 0, !)• {0, . . 0, 1} is a matrix of order r • ?•, all elements of which are zero, except the last element of the main diagonal, which is equal to one. — In a similar way it is seen that the product K'HH'K is a matrix of order 7* r, all elements of which are zero, except the s diagonal elements immediately preceding the last, which are all equal to one. By (30.3.20), the moment matrix (30.3.19) now takes the form I-pp -HH\ It follows from the above that the transformed matrix K* [I — pp' — HH')K is a diagonal matrix, the r — .v — 1 first diagonal elements of which are equal to 1, while the rest are 0. Thus we have proved our assertion about the characteristic numbers of the moment matrix (30.3.19). As observed above, this completes the proof of the theorem. By means of this theorem, we can now introduce a test of the hy])othe8is H in exactly the same way as in the simpler case con- sidered in 30.1. Some examples of the application of this test will be shown in the following paragraph. 30.4. Examples. — We shall here apply the ^^st to two parti- cularly important cases, viz. the Poisson and the normal distribution. Other simple distributions may be treated in a similar way. Ex. 1. The Poisson distribution. Suppose that it is required to test the hypothesis that a given sample of n values . . ., Xn is drawn from some Poisson distribution, with an unknown value of the para- meter A. Every is equal to some non-negative integer 2 , and we arrange the according to their values into r groups, pooling the data for the smallest and the largest values of 2 , where the observa- tions are few. Suppose that we obtain in this way Vk observations with x ^ k, Vi ^ » X = iy where 2 = i + 1, . . ., X; + 7 — 2, Vk+r-i » » x^k-^r—l. If we are write '&i == P(x = /) = the corresponding probabilities 434 30.4 Pk=P{x-^k) = ^xSi, 0 pi = P(x = i) = ■&, for i = i + 1, . . i + r — 2, 00 pk-\-r-l = P(x ^ k -h r — 1) = 2 k hr-1 In order to estimate the unknown parameter A by the modified minimum method, we have to solve the system (30.3.3), or the equi- valent system (30.3.3 a). Since there is only one unknown parameter, we have 5= i, so that each system reduces to one single equation, and (30.3.3 a) gives r - 2 -i- 2 A'-f 1 + ^A + r-l = 0 . 2 A-l-r-l This equation has a single root A = A*, where k 00 2 * A-fr-2 2 ^ tir, 0 n * - + 2 * + n + r-l H 2^' A+ 1 2 tzr, 0 Af r- 1 Here, the second term within the brackets is equal to the sum of all such that k < x^. < A* + r — 1, while the first and the last term give approximately the sum of all Xfi which are or ^A + r—1 respectively. The estimate A* to be used for A is thus approximately equal to the arithmetic mean of the sample values: 1 Taking ^ = 1 in the theorem of the preceding paragraph, we find that the limiting ;^^-di8tribution has in this case r — 2 d. of fr. In Table 30.4.1, three numerical examples of the application of the test are shown. Ex. 1 a) gives the numbers of a-particles radiated from a disc in 2608 periods of 7.6 seconds according to Rutherford and Geiger (Ref. 2, p. 77). Ex. 1 b) gives the numbers of red blood 435 30.4 Table 30 . 4 . 1 . Application of the test to the Poisson distribution. i Ex. 1 a) Ex. 1 b) Ex. 1 c) No. of periods with i nPi No. of com partments with i nPi {Vf-npj? No. Of plants with i flowers n {Vf-np, '■ a par- ticles npi blood- corpuscles npi npf 0 67 54.899 0.1244 1 203 210.628 0.2688 2 383 407.861 1.4668 .3 626 525.496 0.0005 6 4 632 608.418 1.0988 1 2 6 408 393.516 0.5882 3 10 25.0217 2.5717 6 273 253.817 1.4498 6 10 19.1360 O.ooio 7 139 140.825 0.0126 8 16.7956 0.0919 20 24.1984 0.726S 8 46 67.882 7.7132 13 11.4043 0.2288 42 26.7689 8.6736 9 27 29.180 0.1 042 14 15 09S0 0.0792 27 26.8178 0.0177 10 10 17.075 0.0677 15 17.9778 0.4981 25 23.2918 0.1251 11 4 16 19.4661 1 .0247 23 1 8.7889 0.9689 12 2 21 19.3217 0.1458 i 13.8199 0.6754 13 18 17.7032 0. 00.50 1 6 22.7171 J .9861 14 17 16.0010 0.24U5 6 1 16 1 16 11.9599 1 1.3648 4 16 0 8.9084 O.OOlO 17 6 16.8140 0.8282 1 i 18 3 1 i 19 2 j 20 '-2 1 1 21 1 ! Total 2608 2608 000 12.8840 169 169.0000 4 0066 200 200. 0000 16.6406 X = 3.870 X = 11.911 X = 8.860 — 12.886 (9 d. of fr.) X* — 4.006 (9 d. Of fr.) X* = 16.647 (7 d. of fr.) 0.17 P = 0.91 P = 0.08 436 30.4 corpuscles in the 169 compartments of a hsemacytometer observed by N. G. Holmberg. Ex. 1 c) gives the numbers of flowers of 200 plants of Primula vm'is counted by M.-L. Cramer at Uto in 1928. Ac- cording to the rule given in 30.1, the tail groups of each sample have been pooled so that every group contains at least 10 expected observations. Thus e. g. in 1 b) the observed frequency in the groups i = 7 and i = 17 are respectively lH-3-f5-f8 = 17 and 6 -f 3 + 4-2 + 2 + 1 = 14. — The agreement is good in a), and even very good in b), while in c) we find an » almost significant » deviation from the hypothesis of a Poisson distribution, which is mainly due to the excessive number of plants with eight flowers. The cases considered above are representative of classes of variables which often agree well with the Poisson distribution. — When the data show a significant devia- tion from the Poisson distribution, the agreement may sometimes be considerably improved by introducing the hypothesis that the parameter k itself is a random a* variable, distributed in a Pearson type III distribution with the fr. f. ~ ~ i \X) (x > 0), where cc and x are positive parameters. In this way we obtain the negatwe binomial distribution (cf Ex. 21, p. 259), which has interesting applications e. g. to accident and sickness statistics (Greenwood and Yule, Kef. 119, Eggenberger, Kef. 81, Newbold, Ref. 169 a), and to problems connected with the number of individuals be- longing to given species in samples from plant or animal populations (Enerotb, Kef. 81a; Fisher, Corbet and Williams, Ref. 111). In the case of accident data, the in- troduction of a variable X may be interpreted as a way of taking account of the variation of risk among the members of a given population. Analogous interpreta- tions may be advanced in other cases. The subject may also be considered from the point of view of random processes (cf Londberg, Ref. 162). Ex. 2. The normal distribution. Let a sample of n values Xi, . . ., x„ be grouped into r classes, the t:th class containing v,- observations situated in the interval + J /i), where Si = + (f* — 1)^- We want to test the hypothesis that the sample has been drawn from some normal population, with unknown values of the parameters m and a. If the hypothesis is true, the probability pi corresponding to the ^:th class is 2 p (jp— m'* aV^TtJ where the integral is extended over the / z:th class interval. For the two extreme classes (i = 1 and i = r), the intervals should be (— oo, J fe) and (Sr — \ hy + <») respectively. We then have, writing (g— m)* for brevity g[x) = e 20 ' > 437 30.4 dpi dm dpi da — ^ f (x — m}*ff(zjdx — ^- a*V2nJ « The equations (30.3.3 a) then give after some simple reductions, all integrals being extended over the respective class intervals specified above, 1 m = n 2 ^' J xg{x) dx f g(x) dx a 2 1 — dx f g(x)dx We first assume that the grouping has been arranged such that the two extreme classes do not contain any observed values. We then have Vi = Vf = 0. For small values of an approximate solution may be obtained simply by replacing the functions under the integrals by their values in the mid-point of the corresponding class interval. In this way we obtain estimates m* and a* given by the expressions m = my. Thus m* and are identical with the mean x and the variance of the grouped sample, calculated according to the usual rule (cf 27.9) that all sample values in a certain class are placed in the mid-point of the class interval. — In order to obtain a closer approximation, we may develop the functions under the integrals in Taylor’s series about the mid-point For small h, we then find by some calculation that the above formulae should be amended as follows: m Neglecting terms of order we may thus use the mean of the grouped sample as our estimate of m, while Sheppard’s correction (cf 27.9) should be applied to the variance. Even when h is not very small, and when the extreme classes are not actually empty, but contain only a small part of the total sample, 438 30.4 the same procedure will lead to a reasonable approximation. — In practice, it is advisable to pool the extreme classes of a given sample according to the rule given in 30.1, so that every class contains at least 10 expected observations. Our estimates of m and a* should then if possible be the values of x and 5* calculated from the original grouping, before any pooling has taken place, and with Sheppard’s correction applied to 5*. If r is the number of classes after the pooling, and actually used for the calculation of the limiting distribution of has r — 3 d. of fr., since we have determined two parameters from the sample. When the parent distribution is normal, asymptotic expressions for the means and variances of the sample characteristics and have been given in (27.7.9), while the corresponding exact expressions are found in (29.3.7). A further test of the normality of the distribution is obtained by comparing the values of g^ and g^ calculated from an actual sample with the corresponding means and variances. Table 30.4.2. Distribution of mean temperatures for June and July in Stockholm 1841—1940. June July Decrees Celsius Observed Expected Decrees Celsius Observed Expected -12.4 10 12 89 -14.9 11 10.41 12.6—12.9 12 7.89 16.0-16.4 7 6.72 13.0-13.4 0 10.20 16.6-16.9 8 9.00 13.6-13 9 10 11.98 16.0-16.4 13 10.95 14.0—14.4 19 12.62 16.6-16.9 14 12.12 14 6—14.9 10 12.08 17 0-17.4 13 12.20 16.0-16.4 9 10.46 17.6—17.9 6 11.16 16,5-16.9 6 8.19 18.0-18.4 9 9.28 16.0-16.4 7 6.81 18.6-18.9 7 7.02 16.6- 8 7.98 19.0- 12 11.14 Total 100 lOO.oo Total 100 lOO.OO X = 14.28, 8= 1.674, X — 16.98, 8 — 1.616 9x = 0.098, <7, = 0.062, 9x = 0.882, = — 0 044, Z* “ 7.86 (7 d. of fr.) 3.J4 (7 d. of fr.) P == 0.85 p = 0.85 439 30.4 Table .30.4.3. Breadth of beans. = 6.825 mm, h — 0.25 mm. Class number Observed frequency Expected frequency np^ i Normal First approx. Second approx. 32 67.6 17.6 26.6 2 103 132.2 98.8 90.4 3 239 309.8 201.5 277.2 4 624 617.8 648.9 636.8 6 1 187 1 046.7 1 142.2 1 141.1 6 1 650 1 606.8 1 630.4 1 639.0 7 1 883 1 842.8 1 918.1 1 931.6 8 1930 1 910.9 1 892.4 1 906.2 9 1 638 1 697.6 1 687.8 1 699.5 10 1 130 1 277.8 1 168.8 1 163.6 11 737 817.0 762.4 1 746.1 12 427 444.2 441.0 427.8 13 221 206.8 236.6 223.8 14 110 80.7 112.7 109.1 16 67 27.0 47.6 49.7 16 32 lO.o 24.6 32.2 Total 1 12 000 1 12 OOO.O 12 000.0 12 OOO.O ir = 8.512 z’ = 196.6 z* = 84 .S z* = 14.2 S = 0.6168 =— 0.2878 V, = 0.1968 (13 d. of fr.) P < O.ool (12 d. of fr.) P < O.ool (11 d. of fr.) P = 0.19 Table 30.4.2 shows the result of fitting normal curves to the distri- butions of mean temperatures for the months of June and July in Stockholm during the n = 100 years 1841 — 1940. In the original data, the figures are given to the nearest tenth of a grade, so that the exact class intervals are (12.'45, 12.96) etc. We have here used some- what smaller groups than is usually advisable. Both values of indicate a satisfactory agreement with the hypothesis of a normal distribution. The values of and are also given in the table. On the normal hypothesis, the exact expressions (29.3.7) give in both cases E [g^) = 0, D [g^ == 0.288, and E [g^ == — 0.059, D (^g) = 0.466, so that none of the observed values differs significantly from its mean. 440 30.4-5 A dia^am of the sum polyg’on for the June distribution (drawn from the 100 individual sample values), together with the corresponding normal curve, has been given in Fig. 25, p. 328. When or have signiBcant values, the fit obtained by a normal curve may often be considerably improved by using the Charlier or Edgeworth expansions treated in 17.6 — 17.7. We must then bear in mind that, for every additional parameter determined from the sample, the number of d. of fr. should be reduced by one. Table 30.4.3 shows the distribution of the breadths of w = 12 000 beans of Phaseolus vulgaris (Johannsen’s data, quoted from Charlier, ftef. 9, p. 73). On the hypothesis of a normal distribution, we have E(^i) = 0, D(sr,) = 0.0224, and ^(^g) =— 0.0006, D (^j) = 0.0447, so that the actual values of g^ and given in the table both diifer signifi- cantly from the values expected on the normal hypothesis. The table gives also the expected frequencies and the corresponding values of calculated on the three hypotheses that the fr. f. of the standardized variable (17.7.3) or (17.7.5),') X — T is, in accordance with the expansion 1 a) »normal» w(x) = e 2 , b) » first approx. » .... g>[x) — c) »second approx. » . . . gp (^c) — In the first two cases, the deviations of the sample from the hypo- thetical distributions are highly significant, the values of P being <0.001, while in the third case we have P = ().]9, so that the agree- ment is satisfactory. — In Fig. 26, p. 329, we have shown the histo- gram of this distribution, compared with the frequency curve for the »8econd approx. >. More detailed comparisons for this and other examples are given by Cram<5r, Ref. 70. 30.5. Contingency tables. — Suppose that the n individuals of a sample are classified according to two variable arguments (quantitative or not) in a two-way table of the type shown in Table 30. .5.1. A table of this kind is known as a contingency tahlc\ and it is *) By tho .same method as above, it is shown that the estimates to be used for the coefficients y, and are (7, and g^,, as calculated from the grouped sample, using Sheppard’s corrections. 441 30.5 often required to test the hypothesis that the two variable arguments are independent. Denote by ptj the probability that a randomly chosen individual belongs to the iith. row and the ^rth column of the table. Table 30.5.1. Arguments 1 2 8 Total 1 Vxi ^1* 2 ^1$ ^ 2 . r ^ 1 . ^r. Total n The hypothesis of independence is then (cf 21.1.4) equivalent to the hypothesis that there exist r + s constants pi. and p,j such that Pij=Pi.P.jy I j According to this hypothesis, the joint distribution of the two argu- ments contains r -f 5 — 2 unknown parameters, since by means of the last relations two of the r + s constants, say pr, and may be ex- pressed in terms of the remaining r s — 2. In order to apply the fo fbis problem, we have to calculate y* == "V , " npt.P.j where the sum is extended over all rs classes of the contingency table, and replace here the parameters pi, and p. j by their estimates derived from the equations (30.3.3) or (30.3.3 a), which in this case become = 0, (/= 1, . . r— 1), = 0, b-1, . . ,.v - 1 The solution of these equations is 442 30.5 y.? > n so that the estimates to be used are simply the frequency ratios cal* culated from the margfinal totals. Substituting these estimates for and the expression for reduces to {30.5.1) ij Vi. V.j Since we have here rs groups and r + ^ — 2 parameters determined from the sample, the limiting distribution of has + 6* — 2)--l = = () — 1)(^ — 1) d. of fr. — Exact expressions for the mean and the variance of x* defined by (30.5.1) have been given by various authors (cf Haldane, Ref. 123, where further references are given). Assuming that the independence hypothesis is true, we have (30.5.2) The variance has a complicated expression that will not be given here. A large value of x* shows that the deviation from the hypothesis of independence is significant, but gives no direct information about the degree of dependence or association between the arguments. On the other hand, the quantity - JV ^ « " n " Vi. v.i n n is the sample characteristic corresponding to the mean square con- tingency gp* defined by (21.9.6). If q is the smallest of the numbers r and .y, it follows from 21.9 that 0 :< - ' - - ^ q~l n(q—l) ^ 1 . The upper limit 1 is attained when and only when each row (when r ^ s) or each column (when r S s) contains one single element different y* from zero. Thus — rr may be regarded as a measure of the degree n(q-l) , of association indicated by the sample. The distribution of this measure is, of course, obtained by a simple change of variable in the distribu- tion of x*- (For other measures of association, cf e. g. the text* book by Yule-Kendall, Ref. 43, chs 3 — 4.) 443 30.5 At the Swedish census of March 1936, a saiTi])le of 25 263 married couples was taken from the population of all married couples in country districts, who had been married for at most five years. Table 30.5.2 g^ives the distribution of the sample according^ to annual income and number of children. From (30.5.1) we obtain = 568.5 with (5 — 1)(4 — 1) = 12 d. of fr., so that the deviation from the hy])othesis of independence is higfhly significant. On the other hand, the measure y ^ of association is ^ ~ 0. 00760, thus indicating only a slight degree of dependence. Table 30.5.2. Distribution of married couples according to annual income and number of children. Children 0-1 Income (unit 1000 kr) i 1-2 ; 2-r3 H- i 1 Total 1 0 2 161 : 8 677 2 184 1 6H6 9 568 1 2 766 6 081 2 222 1 062 11 no 986 1 763 640 i 306 225 1 419 96 38 89 i 98 HI 14 Total 6 116 10 928 1 6 173 I 3 046 26 26H In the particular case when r — 5 = 2, the contingency table 30.5.1 becomes a 2 -2 table or a fourfold table, and the expression (30.5.1) reduces to ( 30 . 5 !» . = Vi.V2,V,iV,2 so that — corresponds to the expression (21.9.7) for g)“. When the arguments are quantitative, f^ is identical with the square of the correlation coefficient of the sample (cf 21.9.7 and 21.7.3). — In the case of a fourfold table, there is only (2 — 1)(2 — 1) = 1 d. of fr. in the limiting distribution of yf and we have q — 1 = 1 . In Table 30.5.3, we give the distribution of head hair and eyebrow colours of 46 542 Swedish conscripts according to Lundborg and Lin- ders (Ref. 26). From (30.5.3) we obtain ;c“~ 19 288 and = 0.414, indicating a marked dependence between the arguments. 444 30.5-6 Table 30.5.3. Hair colurs of Swedish conscripts. Eyebrows Head hair Light or Dark or red 1 wediiim Total liight or red Park or niodiiim . . . 1 30 472 1 3 238 3 364 1 0 468 j 33 710 1 12 832 Total S3 836 12 706 46 542 When the expected frequencies in a fourfold table are small the approximation obtained by the usual tables will be improved if we calculate x^ from the first expression (30.5.1), and reduce the Vi V i absolute value of each difference Vfj by i before squaring^. This is known as Yates correciiov (Ref. 250). 30.6. x^ as a test of homogeneity. — The contingency table 30.5.1 expresses the joint result of a sequence of n repetitions of a random experiment, each individual result being classified according to two variable arguments. In many cases, however, we encounter tables of the same formal appearance, where the situation is different. Suppose that we have made s successive sequences of observations, consisting of observations respectively, wliere the numbers are not determined by chance, but are simply to be regarded as given numbers. At each observation we observe a certain variable argument, and the results of each sequence are classified according to this argument in r groups, the number of observations in the z:th group of the ./ith sequence being denoted by Vi j. Our data will then be expressed by a table which is formally identical with Table 30..5.1, the column totals vj being here denoted by nj. In the present case, however, the table does not express the result of one single sequence of observations, but of independent sequences, each of which cor- responds to one column of the table. In such a case, it is often required to test the hypothesis that the s samples represented by the columns are drawn from the same population^ so that the data are homogeneous in this respect. This is equivalent to the hypothesis that there are r constants pj, . . pr with 445 30.6 2 such that the propability of a result belonging to the i:th i group is equal to pi in all s sequences. In order to test this hypothesis, we calculate from the same formula (30.5.1) as in the previous case. A slight modification of the proof of 30.3 then shows that, if the hypothesis is true, x^ has the usual limiting distribution with the same number (r — 1)(.9 — 1) of d, of fr. as before. Unlike the corresponding proposition of the preceding paragraph, this is not a direct corollary of the general theorem of 30.3, but requires separate proof. The theorem of 30.3 may, in fact, he generalized to the case when we consider 8 inde- pendent samples of Wj, . . ., individuals, all with the same r frequency groups, and determine a certain number, say /, of unknown parameters by applying the modified generalization of the proof of 30.3 then shows that y* has the usual limiting distri- bution with (r — 1) 8 — t d. of fr. In the case considered above, we are concerned with the hypothesis that the 8 samples are drawn from the same population, without further specification of the distribution, so that the parameters are the probabilities Pj themselves. Owing to the relation = 1 there are f = r — 1 parameters, and thus I (r - l)(s - 1) d. of fr. By means of the generalized theorem, we may also apply hypo- thesis that 8 given samples are drawn from the same population of a specified type such as the Poisson, the normal, etc. In such a case, the application of the modified X^ minimum method to the above expression for x^ shows that the parameters of the distribution should be determined in the same way as if w’e were concerned with one single sample with group frequencies equal to the row sums v,. of the given table. The proof of this statement will be left as an exercise for the reader. In the particular case when r = 2, the table may be written: (v, — W • X^ minimum method to the expression Z* ^ X ' * straightforwa Vi . . Vg j n^ — v, . n, — v. « — 2 * i j Wg . . ng 1 n 1 We are here concerned with s sequences of observations, the number of occurrences of a certain event E being respectively . . ., Vg, and we ask whether it is reasonable to assume that E has a constant, though unknown probability p throughout the observations. The 446 30.6 estimate to be used for p will here be the frequency ratio of E in the totality of the data: jP* = ^ “ 2 obtain from the formula (30.5.1)^) (30.6.1) P with 5—1 d. of fr. Writing quantity Q is n — 1 rr(7- i)' identical with the divergence coefficient introduced by Lexis. In ac- cordance with (30.5.2), we have E(Q^)=l. (Cf e. g. Tschuprow, Ref. 227 a, Cramer, Ref. 10, p. 105 — 123.) Table 30.6.1 gives the number of children born in Sweden during the 5=12 months of the year 1 935. The estimated probability of a male birth is p* = = 0.517 6082. From (30.6.1) we find x* = 14.986 with 11 d. of fr., which corresponds to P==0.]8, so that the data are consistent with the hypothesis of a constant probability. Table 30.6.1. Sex distribution of children born in Sweden in 1935. 1 Boys . . . Girls . . . 1 2 3 4 6 M O 1 6 1 t h 7 8 9 10 11 12 Total 3743 3637 3560 3407 4017 3866 4178 3711 4117 3776 j 3944 1 1 3666 3964 3621 3797 8696 3712 3491 3512 3391 3392 3160 3761 3371 46682 42691 Total 7280 6967 7883 7884 7892 7609 7686 7393 7203 6903 6662 7132 88273 We finally consider the case 5 = 2. In this case we are concerned with two independent samples, and we want to know whether these are drawn from the same population. The table may then be written Pi f*i + »’i p2 ^2 Pr Vr fir + Vr m n m n *) Cf also (30.2.1). 447 30.6 We have r — 1 d. of fr., and (30.5.1) gives (cf. K. Pearson, Ref. 180, and R. A. Fisher, Ref. 91) (30.6.2) jr- = W) M 2 -f n ) Writing tzTi — — and ^ ^ this reduces to the following + Vi m + n expression due to Snedecor (Ref. 35, p. 173), which is often convenient for practical computation, (30.6.3) JJSJ ^ (w + «)“ / V \ m n fXt -f V, m + 71 J Table 30.6.2 gives some income distributions from the Swedish census of 1930. When we compare the income distributions of the age groups 40 — 50 and 50 — 60 for all industrial workers and employees, (30.6.3) gives = 840.62 with 5 d. of fr., showing a highly significant difference between the distributions. It is evident that in this case Table 30.6.2. Income distributions from Swedish census of 1930. All workers and employees in industry Foremen in industry Income (unit 1000 kr) Ai?e i^roup C7' t Ajjje group 0 1 o 1 60 — 60 i 40 — 60 60 — 60 0 — 1 7 831 1 j 7 668 0.6088 6997 71 64 , 0.6f580 0000 1 1“2 26 740 20 686 0..')63K 3764 430 324 0 .^)702 9178 1 2 — 3 30 672 24 186 0.6962 6758 1 072 894 0.. "*462 6958 ' 1 3 — 4 20 000 i 12 286 0.6196 8472 1 609 1 202 0.5728 9417 4 — 6 11 627 6 776 0.6297 8747 1 178 903 0.6660 7400 6 — 6 919 i 4 222 0.6210 3940 168 112 1 0., 58.^1 8.JS19 Total 108 698 76 707 1 0.5892 2981 4 618 3 489 j 0.5642 6628 z’ = 840.82 (6 d. P < 0 001 of fr.) 1 = i 4.27 (6 d. of fr.) 1 0.61 448 30.6-7 the numbers tzTj show a tendency to increase with increasing^ income. When we pass to the more homogfeneous group of the foremen, however, this tendency disappears, and the comparison of the income distribu- tions of the two age groups gives here — 4.27 and P” 0.61, so that we may consider these two samples as drawn from the same population. 30.7. Criterion of differential death rates. — Suppose that, in a mortality investigation, we have obtained the following data for two different classes (districts, occupations etc.) of persons; Age group Class A ClASH B Exposed to risk Deaths Exposed to risk | Deaths 1 [ d, m', d\ 2 Mi <h ' iK 1 • • 1 i dr 1 f/; ; It is required to test whether the sequences of death rates (Ijnt and obtained from these data are significantly different. For each age group, we may form a 2 • 2 table of the type Class A. Class B. Dead Surviving ni — f/, v'l — (1\ and calculate from (30.0,2) the corresjumding quantity (di -f d\) [ni -f n'i — r/^ — d[) \'/^ ni) which has one d. of fr. The successive Xl are independent. Thus if we assume that the two populations have identical death rates, the sum X^^^X] has the usual limiting distribution with r d. of fr., i and this provides a test of the hypothesis (cf K. Pearson and Tocher, Ref. 187; R. A. Fisher, Ref. 91; Wahlund, Ref. 228). Table 30.7.1 contains some data from a tuberculosis investif,'ation by G. Berg (Ref. 61). It is required to test whether there are any significant differences in mortality between the two sexes during the 29-4S4 H. Cramer 449 30.7-8 first jear after the finding of T. B. plus. The total amounts to 22.2 with 10 d. of fr., which corresponds to P = O.ou, so that the deviation is ^almost 8ignificant» according to our conventional termi- nology (cf 30.2). Prom the values of XH given in the last column of the table, it is seen that the main contributions to arise from the ages 30 — 50, where the women show a considerably higher mor- tality than the men. Table 30.7.1. Death rates for patients suffering from open pulmonary tuberculosis, during first year after finding T.B. plus. Men Women Age group Exposed to risk Deaths Death rate % Exposed to risk Deaths < Death rate % z? 16-19 406 166 38.4 600 174 34.8 1.26 20-24 695 204 29.4 816 246 30.1 0.11 26-29 686 169 28.9 619 184 29.7 0.09 30-34 464 128 28.2 ■ 433 160 34.6 4.22 36-39 274 82 29.9 1 267 92 36.8 2.10 1 o 221 68 1 30.8 194 j 83 42.8 6.48 46-49 163 41 26.8 04 39 41.5 5.75 60-64 no 34 80.9 ; 68 20 34.5 0.28 66-69 69 36 62 2 ; 29 13 44.8 0.45 60- 89 43 48.8 ' 47 28 I 69.6 1.57 Total 3 066 961 3 047 1 029 22.20 30.8. Further tests of goodness of fit. — As already observed in 30.1, it is always advisable to try to supplement the x^ by other methods. In many cases, a simple inspection of the signs and magni- tudes of the differences between observed and expected frequencies will reveal systematic deviations from the hypothesis tested, even though x^ DDia-y have a non significant value. When the X^ test is applied to a comparatively small sample, it is necessary to use a grouping with large class intervals, and thus sacri- fice a good deal of the information conveyed by the sample. In such cases, it would be desirable to have recourse to a test based 6n the individual sample values. We shall now briefly mention a test of this type. 450 ao.8 Let it be required to test the hypothesis that a sample of n ob- served values Xj, . . has been drawn from a population with the given d. f. F(x). The d. f. of the sample (cf 25.3) is F* (x) = v/w, where V is the number of sample values ^ x. Since f'* converges in prob- abibility to F (cf 25.5) for any fixed x, we may consider the integral — 00 where K(x) may be more or less arbitrarily chosen, as a measure of the deviation of our sample from the hypothesis. Tests based on measures of this type were first introduced by Cramer (Ref. 10 and 70) and von Mises (Ref. 27). Following Smirnoff (Ref. 215), we shall here take K(x) = F(x), and thus obtain the integral 00 £ 0 * = / [P” {x) - A(ir)]* dF(x). If the sample values Xn are arranged in increasing order, we have for any continuous F(x) 0) a 1 12m* When the individual sample values are known, the exact value of w* may thus be simply calculated. When only a grouped sample is avail- able, an approximate value can be found, e. g. by the usual assumption that the Xv are situated in the mid-points of the class intervals. As observed in 25.5, F* (x) is the frequency ratio in n trials of an F{} — F) event of probability 2^'(ir). Hence E[F* — Ff = ~ • By means of this remark, it is possible to find the mean and the variance of cj*. These are independent of ^"(.t), and we have 4 — 3 Comparing the value of w* found in an actual sample with the mean and the variance calculated from these expressions, we obtain a test of our hypothesis. — The sampling distribution of w*, which is inde- pendent of F[x), has been further investigated by Smirnoff (Ref. 215), who has shown that n w* has, as w -> oo , a certain non-normal limiting 451 30.8-31.1 distribution independent of n (cf the case of ii r* in 29.13). It would be desirable to extend the theory to cases when the hypothetical F{x) is not completely specified, but contains certain parameters that must be estimated from the sample. Further important tests of j^^oodness of fit have been proposed e. g. by Neyman (Ref. 164) and E. S. Pearson (Ref. 191). CHAPTER 31. Tests of Significance for Parameters. 31.1. Tests based on standard errors. — In the applications, it is often required to use a set of sample values for testing the hypo- thesis that a certain parameter of the corresponding population, such as a mean, a correlation coefficient, etc., has some value given in ad- vance. In other cases, several independent samples are available, and we want to test whether the differences between the observed values of a certain sample characteristic are significant, i. e. indicative of a real difference between the corresponding population parameters. Now we have seen in Ch. 28 that important classes of sample characteristics are, in large samples, asymptotically normal with means and variances determined by certain population parameters. Hence we may deduce tests of significance for hypotheses of the above type, following the general procedure indicated in 26.2 (cf also 35.1). Thus if we draw a sample of n values Xn from any popula- tion (not necessarily normal) with the mean m and the s. d. a, we know by 17.4 and 28.2 that the mean x of the sample values is asymptotic- ally normal (?n, alV n). Suppose for one moment that we know a, and that we are testing the hypothesis that m has a specified value If the hypothesis is true, x is asymptotically normal (wq, ajV n). De- noting by Ip the p % value of a normal deviate (cf 17.2), we thus have for large n a probability of approximately p % to encounter a deviation |i:* — rWo| exceeding IpolV n. Working on a jo % level, we should thus reject the hypothesis if | i? — Wq | exceeds this limit, whereas a smaller deviation should be regarded as consistent with the hypo- thesis. 452 31.1 Now in practice we usually do not know a. By 27.3 we know, however, that the s. d. s of the sample converges in probability to cr as n oo. Hence for large n there will only be a small probability that s diflEers from a by more than a small amount. For the purposes of our test, we may thus simply replace o by s, and act as if we had to test the hypothesis that J: were normal (wq, s/V w), where s is the known value calculated from our sample. An observed deviation I .r — wiq I exceeding kp s/V n will then lead us to reject the hypothesis m = niQ on a JO % level, while a smaller deviation will be regarded as consistent with the hypothesis. The same method may be applied in more general cases. Consider any sample characteristic 2 , the distribution of which in large samples is asymptotically normal. In the expression for the variance of the asymptotic normal distribution of 2 , we replace any unknown popula- tion parameter bv the corresponding known sample characteristic, retaining only the leading term of the expression for large n. The expression d( 2 ) thus obtained will be denoted as the standard error of 2 in large samples. If it is required to test the hypothesis that the mean £( 2 ) has some specified value 2 q, we regard 2 as normally distributed with the known s. d. d( 2 ). If the deviation — ex- ceeds kpd( 2 ), the hypothesis will then be rejected on the^) % level, and otherwise accepted. In this way, all expressions deduced in Chs 27 — 28 for the s. d:s of sample characteristics and of their asymptotic normal distributions may be transformed to standard errors. Thus e. g. by (27.2.1), (27.4.2) and (27.7.2) the standard errors of the sample mean f, the sample variance s“ ~ and the sample s. d. s = Vm 2 are d(:i) = V ft d (.-•) = \ — S* 2s V n If it is assumed that the population is normal, the simpler expressions corresponding to this case may be applied. Thus e. g. by 28.5 the standard error of the median of a normal sample is ^ Vnl[2 n) = 1.2633 s/Vm . When a sample characteristic 2 has been computed, it is customary in practice to indicate its degree of reliability by writing the value 2 followed by ± d(z). Thus e. g. the sample mean is written + s/V », 453 31.1 etc — For the frequency ratio in n trials of an event of constant probability p, we have by (16.2.2) E(vln) = p and D(v/n) = V pq/n, 1 / vin — v) is L -j, — , and consequently the fre- so that the standard error is quency ratio will be written ± 1 / V ■ The corresponding per- txrllOO — vs) n centage vS — 100^^ is accordingly written vS ± If two independent samples are given, the difference between their means or any other characteristics may be tested with the aid of the standard errors. If the means :v and y are regarded as normal (w„ sjV Wi) and (wig, v^/V Wg) respectively, the difference i — y will be normal hypothesis concerning the value of the difference m, -- can now be tested in the way shown above. In particular, the hypothesis will be rejected on the p % level, if I £• — y I > Aj •Vi + and otherwise accepted. nt All the above methods are valid subject to the condition that our samples are » large ». There are two kinds of approximations in- volved, as we have supposed a) that the sampling distributions of our characteristics are normal, and b) that certain population characteris- tics may be replaced by the corresponding values calculated from the sample. In practice, it is often difficult to know whether our samples are so large that these approximations are valid. However, some practical rules may be given. When we are dealing with means, the approximation is usually good already for n > 30. For variances, me- dians, coefficients of skewness and excess, correlation coefficients in the neighbourhood of ^ = 0, etc., it is advisable to require that n should be at least about 100. For correlation coefficients considerably different from zero, even samples of 300 do not always give a satis- factory approximation. Even in cases where n is smaller than required by these rules, or where the sampling distribution does not tend to normality, it is often possible to draw some information from the standard errors, though great caution is always to be recommended. — When the sampling distribution deviates considerably from the normal, the tables of the normal distribution do not give a satisfactory approximation to the probability of a deviation exceeding a given amount. We can then 454 31.1-2 always use the inequality (15.7.2), which for any distribution gives the upper limit 1/i® for the probability of a deviation from the mean exceeding h times the s. d. However, in most cases occurring in prac- tice this limit is unnecessarily large. It follows, e. g., from (15.7.4) that for all unimodal and moderately skew distributions the limit may be substantially lowered. The same thing follows from the inequality given in Ex. 6, p. 25b, if we assume that the coefficient of the distribution is of moderate size. When there are reasons to assume that the sampling distribution belongs to one of these classes, a devia- tion exceeding four times the s. d. may as a rule be regarded as clearly significant. — When n is not large enough, it is advisable to use the complete expressions of the s. d:s, if these are available, and not only the leading terms. Further, we should then use the unbiased estimates (cf 27.b) of the population values, thus writing e. g. siV n — 1 instead of s/V n for the standard error of the mean. — Whenever possible it is, however, preferable to use in such cases the tests based on exact distributions that will be treated in the next paragraph. 31.2. Tests based on exact distributions. — When the exact sampling distributions of the relevant characteristics are known, the approxi- mate methods of the preceding paragraph may be replaced by exact methods. As observed in 29.1, this situation arises chiefly in cases where we are sampling from normal populations. Suppose, e. g., that we are given a sample of n from a normal population, with unknown parameters m and a, and that it is required to test the hypothesis that m has some value given in advance. If this hypothesis is true, the sample mean J: is exactly normal (m,a/Vn), and the standardized variable V is normal (0,1). The approxi- mate method of the preceding paragraph consists in replacing the unknown a by an estimate calculated from the sample — for suiall « preferably j - and regard the expression thus obtained, t = Vn — 1'" as normal (0,1). Now t is identical with the Student ratio of 29.4, and we have seen that the exact distri- bution of t is Student’s distribution with n — 1 d. of fr. If tp denotes the p % value (cf 18.2) of t for « - 1 d. of fr., the probability of a deviation such that |<1 > is thus exactly equal to p The ypo 455 31.2 thefcical value m will thus have to be rejected on a % level if 1 ^ I > and otherwise accepted. As w -► OQ , the f-distribution approaches the normal form (cf 20.2), and the figures for this limiting case are given in the last row of Table 4. It is seen from the table that the normal distribution gives a fairly good approximation to the ^-distribution when n S 30. For small li, however, the probability of a large deviation from the mean is substantially greater in the ^-distribution (cf Fig. 20, p. 240). When we wish to test whether the means x and y of two inde- pendent normal samples are significantly different, we may set up the »null hypothesis » that the two samples are drawn from the same nor- mal 2 wpulation. It has been shown in 29.4 that, if this hypothesis is true, the variable ~ 2) ^ _ X — if has the ^-distribution with + n^ — 2 d. of fr. When the means and variances of the samples are given, n can be directly calculated. If \u\ exceeds the p % value of t for 2 d. of fr., our data show a significant deviation from the null hypothesis on the jd % level. If we have reason to assume that the populations are in fact normal, and that the s. d:s and are equal, the rejection of the null hypothesis implies that the means and m^ are different (cf 35.5). It is evident that we may proceed in the same way in respect of any function e of sample values, as soon as the exact distribution of 2 is known. We set up a probability hypothesis, according to which an observed value of 2 would with great probability lie in the neigh- bourhood of some known quantity 2 ^, If the hypothesis H is true, 2 has a certain known distribution, and from this distribution we may find the }) % value of the deviation \2 — 2 q\, i. e. a quantity hp such that the probability of a deviation \2 — ZQ\>hp is exactly p %. Working on a % level, and. always following the procedure of 26.2, we should then reject the hypothesis H if in an actual sample we find a deviation \2 — 2 q\ exceeding hp, while a smaller deviation should be regarded as consistent with the hypothesis (cf 35.1). When we are concerned with sam])les drawn from normal popula- tions, tests of significance for various parameters may thus be founded on the exact sampling distributions deduced in Ch. 29. In practice, it is very often legitimate to assume that the variables encountered 4.56 31.2-3 in different branches of statistical work are at least approximately normal (cf 17.8). In such cases, the tests deduced for the exactly normal case will usually give a reasonable approximation. It has, in fact, been shown that the sampling distributions of various important characteristics are not seriously affected even by considerable devia- tions from normality in the population. In this respect, the reader may be referred to some experimental investigations by E. S. Pearson (Ref. 190), and to the dissertation of Quensel (Ref. 200) on certain sampling distributions connected with a population of Charlier’s type A. It seems desirable that investigations of these types should be further extended. 31.3. Examples. — We now proceed to show some applications of tests of the types discussed in the two preceding paragraphs. We shall first consider some cases where the samples are so large that it is perfectly legitimate to use the tests based on standard errors, and then proceed to various cases of samples of small or moderate size. With respect to the significance of the deviations etc. appearing in the examples, we shall use the conventional terminology introduced in 30.2. Ex. 1. In Table 31.3.1 we give the distribution according to sex and ages of parents of 928 570 children born in Norway during the years 1871—1900. (From Wicksell, Ref. 231.) It is required to use these data to investigate the influence, if any, of the ages of the parents on the sex ratio of the offspring. As a first ai)proach to the problem, we calculate from the table the percentage of male births, and the corresponding standard error, for four large age groups, as shown by Table 31.3.2. There are no significant differences between the numbers in this table. The largest difference occurs between the numbers 51.68i^ and 51.111, and this difference is 0.478 ± 0.222. The observed difference is here 2.15 times its standard error, and according to our conventional terminology this is only ^almost significant^. Nevertheless, the table might suggest a conjecture that the excess of boys would tend to increase when the age difference x — y decreases. In order to investigate the question more thoroughly, we consider the ages x and y of the parents of a child as an observed value of a two-dimensional random variable. Table 31.3.1 then gives the joint distributions of x and y for two samples of — 477 533 and 7^2 = 451 037 values, for the boys and the girls respectively. If the 457 31.3 Table 31.3.1. Live born children in Norway 1871 — 1900. Age of father Age of mother y Total X —20 20—25 26—30 30—36 36—40 40—46 45— Boys —20 377 974 665 187 93 26 6 2 217 20— 2b 2 173 18 048 11 173 3 448 1 022 258 30 36 147 26—30 1 814 26 956 43 082 16 760 4 604 973 123 94 272 30—36 700 14 262 38 605 41 208 14 475 3 243 287 112 670 35—40 238 4 738 17 914 32 240 31 673 8 426 836 95 965 40—46 103 1 791 6 586 10 214 24 770 18 079 2 171 69 714 46-60 47 695 2 593 6 952 12 463 13 170 4 000 38 916 60—55 21 311 996 2 603 4 492 6 322 2 674 17 218 66—60 5 133 412 926 1 790 2 141 1 080 6 492 60—66 10 67 190 408 736 822 348 2 671 66—70 6 26 68 173 266 283 131 952 70— 2 12 46 69 119 113 48 399 Total 5 496 67 987 122 119 120 077 96 353 63 855 11 646 477 633 Girls —20 1 319 1 1 861 ! ' ! 604 i 1 1 206 91 i 22 ! • 3 1 2 006 20—25 2 133 16 990 10 643 3 193 1 979 242 46 34 225 26—30 1 793 25 147 40817 16 637 1 4 305 943 96 88 738 30-35 : 707 13 264 36 746 38 619 13 669 3 018 292 106 304 35- 40 1 236 4 676 17 165 30 463 i 29 858 7 883 772 1 91043 40—46 101 1 670 6 278 15 323 23 803 10 983 1 1 941 1 66 099 46—50 38 640 2 384 6 003 11 764 12 336 1 3 823 i 30 588 60—56 16 284 964 1 2 469 4 221 5 816 j 2 480 1 1 16 249 55—60 i 12 120 1 406 : 874 ! 1 726 2 000 ' 1 079 ’ 0 217 60—66 6 64 i 171 381 1 691 760 326 2 278 65—70 * 3 29 1 87 154 277 247 114 911 70— i 1 18 1 30 i 67 ! 108 116 40 379 Total 5 866 03 743 116 194 n2 979 91 392 50 354 11 010 1 461 037 sex ratio amongf the newborn varies with the ages of the parents, the \x, jy)-distribution must be different for the boys and the girls, so that the two samples are not drawn from the same population. 458 31.3 Table 31.3.2. Percentag'e of male births. Age of father Age of mother y X < 30 > 30 < 35 61.409 ±0.090 61.589 ±0.122 > 35 61 .111 ±0.186 61. 430 ±0.081 Table 31.3.3. i!>ample moments for Table 31.3.1, in units of the classbreadth (5 years). Boys I Girls Central j : moments | ' i Baw ' CJorrected Kaw j t’orrccted 1 Wjo 2.9127 2.8294 2.9086 2.8203 1.4140 1.4140 ; 1 .4085 i 1.4086 1.7966 1.7128 1 1 .7929 1 ] 7096 3.0699 3.0699 1 ! 3.0891 ' 3.0391 ! Wl()3 0.4588 0.4.588 0.4588 0.4688 28.6679 27.2807 28.4.536 27.0.309 1 Wa, 10,3627 y.9992 , 10 2609 ' 9.8988 1 1 «‘j. 7.7286 7.8431 ! 7.6970 i 7.3126 6.8110 6.4576 ; 5.8020 i 6.4499 I »«(.< 7,6250 6.6564 7.5260 6.6.587 Table 31.3.3 shows the uncorrected moments of the two samples, and the corrected moments calculated according to (27.9.4) and (27.9.6). We first observe that the distributions deviate significantly from nor- mality. Consider, e. g., the marginal distribution of the father s age r for the boys. On the hypothesis that this distribution is normal, we find from the corrected moments gi = 0.6460 .+ O.ooso and — = 0.4016 + 0.OO71, where the standard errors are calculated from (27.7.9). In both cases, the deviation from zero is highly significant, so that the hypothesis of normality is clearly disproved.') ‘) According to Wicksell, 1. c., the distribution is approximately logarithmko- nonual (cf 17 . 6 ). 459 313 Table 31.3.4. Sample characteristics for Table 31.3.1. Unit: one year. Characteristics Boys Girls 10* • Diff. 35.699 ±0.0122 35.708 ±0.0126 -r 4±]7.5 y 32.128 ± 0.0095 32.116 ± 0.0097 -12±13.6 X — y 3.571 ±0 0096 3.687 ± 0.0097 + 16±13.6 8.410 ± 0.0094 8.897 ±0.0097 -13±13.5 Sf, 6.648 ± 0.0058 6.688 ± 0.0056 “ 6±7.6 r 0.6424 ±0. 00097 0 6414 ± O.OOIOI — 1.0± 1.40 Table 31.3.4 gives the values of some important sample characteris- tics for the boys and the girls, as well as the differences between corresponding characteristics for both sexes. The standard errors have been calculated according to the rules of 31.1 from the general for- mulae (27.2.1), (27.7.2) and (27.8.1); thus the simpler expressions (27.8.2) and (29.3.3) corresponding to the case of a normal population have not been applied here. For the difference ./ — Tf, we find D* [x — fj) = (oj — 2 p Oj 4- a!)/w, and consequently the square of the standard error is [x — y) -*= [s\ — 2 4- It is seen from the table that there are no significant differences between the characteristics. In particular we find that the mean of the age difference x — y is not significantly greater for the girls than for the boys, so that the conjecture suggested by Table 31.3.2 is not supported by further analysis. Finally, we may directly apply the method to test whether the two samples in Table 31.3.1 ,may be regarded as drawn from the same population. In each of the two samples we have, in fact, 12 *7 = 84 frequency groups, so that the whole table 31.3.1 may be rearranged as an 84 *2 table of the type considered in 30.6, which may be tested for homogeneity by the method, using (30.6.2) or (30.6.3) for the calculation of Pooling all groups with fathers above 60, and with mothers above 40, we have a 60-2 table, and find = 51.97 with (60— - 1) (2 — 1) = 59 d. of fr. According to Fisher’s approximation 460 313 <cf 20.2), V2 x* = 10 .20 would thou be an observed value of a normal variable with the mean V"ll7 = 10.82 and unit s. d. By Table 1, the probability of obtaining a value of at least as large as that actually observed is then approximately 1 — <D (10.20 — 10.82) = 0.78, so that the agreement is very good, and the data are consistent with the hypothesis that the samples are drawn from the same population. The analysis of the data in Table 31.3.1 has thus entirely failed to detect any significant influence of the ages of the parents on the sex of the children. Ex. 2. In a racially homogeneous human population, the distribu- tions of various body measurements usually agree well with the nor- mal curve, and the small deviations are well represented by the first terms of a Charlier or Edgeworth series, as given e. g. by (17.7.5). We refer in this connection to a paper by Cramer (Ref. 70), where detailed examples are given. In such cases, the standard errors of sample characteristics may be calculated from the simplified expressions which hold for the case of a normal parent distribution. Thus by (29.3.3) the standard error of s may be put equal to .s/T 2?/, the standard error of the coefficient of variation V may be calculated from (27.7.11), etc. For the stature of Swedish conscripts, measured in the years 1915 — 10 and 1924 — 1925 at an average age of 19 years 8 months, we find according to Hultkrantz (Ref. 128) the sample characteristics given in Table 31.3.5. The table shows a highly significant increase of the mean and the median during the interval of 9 years between the measurements. On the other hand, the s. d. and the coefficient Table 31.3.5. Sample characteristics for the stature of Swedish conscripts. (’hanicteristics 1915—16 1924—26 10* - Di«. n i 80 084 89 387 Meau 7l . cm 171.80 + 0.022 1 72.68 ±0.020 + 78 ±3.0 Median . » 171.81 ±0.027 ' 172.65 ±0.026 + 74 ±3.7 8. d. « , » 6.16 ±0.015 ^ 6.04 ±0.014 -ll±2.l Semi-intenjaaTtile range . » 4.05 ±0.017 4.02 ±0.016 - 8 ±2.8 100 K= 100 «/x . . . . • • 3.58 + 0.0090 i 3.60 ±0.0088 - 8±1.2 461 31.3 of variation show a highly significant decrease, while the decrease of the semi-interquartile range is not significant. These results agree well with further available data from Swedish conscription measurements. During the last 100 years, the mean stature of the conscripts has steadily increased, while the s. d. has decreased. According to Table 31.3.5, the increase of the mean stature for the observed samples during the period of 9 years amounts to 0.78 ± 0.080 cm. What kind of conclusions can we draw from this fact with respect to the unknown increase dm of the population mean w?? — We have, in fact, observed the value 0.78 cm of a vari- able which is approximately normally distributed, with the unknown mean dm^ and a s. d. approximately equal to 0.030 cm. Let us, for the sake of the argument, assume that the word » approximately » may be omitted in both places, and let as usual Ap denote the p % value of a normal deviate (cf 17.2). Consider the hypothesis that dm is equal to a given quantity c. If we are working on a j? % level, this hypothesis will evidently be regarded as consistent with the data if c is situated between the limits 0.78 ± O.oso Ap, while otherwise it will be rejected. The quantities 0.78 ± 0.030 Ap are called the % confidence limits for dm^ and the interval between these limits is the p % con- fidence interval. — We shall return to these concepts in Ch. 34. Ex. 3. The occurrence of exceptionally high or low water levels in lakes or rivers is often of great practical importance. For the average water levels of Lake Vanern in the month of June of the w = 124 years 1807 — 1930, we have (data from Lindquist, Ref. 149) the mean x = 4454.6 cm above sea level, and the s. d. .s* — 48.61 cm. The distribution agrees well with the normal curve. Grouping the original data (which are not given here) into 9 groups with the class- breadth h = 20 cm, we find = 3.728. For 9 — 2 1 = 6 d. of fr. this gives P==0.7l, so that the fit is very good. If we denote by Xv the r:th value from the top in a normal sample of n values, while is the r:th value from the bottom, the mean and the s. d. of Xv are ♦given by (28.6.16), while the corresponding expressions for are obtained by obvious modifications. Replacing in these expressions the population parameters m and a by the sample values x and s given above, and neglecting the error terms, we obtain the means and standard errors given in Table 31.3.6, which also shows the extreme June levels actually observed during the period. 462 31. a Table 31.3.6. Extreme water levels of Lake Vanem, June 1807 — 1930. V observed B(x^) approx. Diff. in units of stand, error observed approx. Diff. in units of stand, error 1 4666 4682.1 20.04 — 0.80 4360 4326.9 20.04 - f - 1.15 2 4648 4666.5 12.56 — 1.47 4366 4842.6 12.55 - 1 - 1.07 3 4646 4668.7 9.82 — 1.29 4360 4350.4 9.82 -f 0.98 4 4636 4663.4 8.82 — 2.21 4366 4356.6 8.82 4 - 1,26 5 4636 4549.6 7.85 — 1.97 4366 4359.5 7.85 - 1 - 0.88 The absolute magnitude of the differences between the observed values and their means is in no case greater than might well be due to random fluctuations. We observe, however, that all the lie below their means, and conversely for the pv- This is partly due to the correlation between the (and the and partly to the fact that the approximate mean values are affected with considerable errors, since we are dealing with the comparatively low value n = 124. If we may assume that the distribution will remain unaltered for a period of, say, 500 years, we obtain in the same way as above the mean 4603.6 cm, and the standard error 17.6 cm, for the upper ex- treme level during this period. It would thus seem highly improb- able that a level exceeding 4603.6 -f 4 • 17.6 = 4673.9 cm will occur during this period. Ex. 4. From Student’s classical paper (Ref. 221) on the ^-distri- bution, we quote the figures given in Table 31.3.7. It is required to test whether there is any significant difference between the effects of the drugs A and R. If we assume that the difference between the gains in sleep effected by the two drugs is normally distributed, the last column of the table constitutes a sample of n = 10 values from a normal population. On the usual null hypothesis that there is no difference between the effects, the mean of this population is zero. If this hypothesis is true, the Student ratio < = V ^ ~ s- distributed in the ^-distribution with 9 d. of fr. (cf 31.2). From the observed values, we find < = 4.06, which by Table 4 corresponds to a value of P between O.oi and O.ooi. Thus the deviation from zero is significant, and the null hypothesis is disproved. 463 31.3 Table 31.3.7. Additional hours of sleep gained by ten patients through the use of two soporific drugs A and JB. Patient Drug A X Drug D y Difference z^x — y 1 1.9 0.7 1.2 2 0.8 — 1.6 2.4 3 1.1 -0.2 l.S 4 0.1 -1.2 1.8 6 -0.1 -0.1 O.o 6 4.4 ! 3.4 1.0 7 6.5 3.7 1.8 8 1.6 0.8 0.8 9 4.6 O.O 4.6 10 3.4 2.0 1.4 X = 2.88 «, = 1.899 y — 0.75 8, = 1.697 F = IM — 1.167 In this case, where we have the low value n = 10, it is to be ex- pected that the approximate test based on the standard error of i will not give a very accurate result. If we apply this test, and use the estimate s^/VlO — 1 for the standard error, we are led to regard the same value as above, 1^9 — Oj/sg = 4.06, as an observed value of a variable which, on the null hypothesis, is normal (0, 1). By Table 2, this corresponds to P<0.oooi. If we compare this with the value of P given by the exact test, it is seen that the error involved in applying the approximate test tends to exaggerate the significance of the deviation. If, in the experiments recorded in Table 31.3.7, two difiPerent sets of ten patients had been used to test the two drugs, the data might also have been treated in another way (cf R. A. Fisher, Ref. 13, p. 123 — 125). Suppose that for each drug the gain in sleep is normally distributed, the s. d. having the same value in both cases. The samples headed x and t/ are then independent samples from normal popula- tions with the same a, and it is required to test the null hypothesis that the two population means and Wg are equal. The variable u defined by (31.2.1), where we have to take = w* = 10, then has the 464 31.3 ^distribution with 18 d. of fr., and from Table 31.3.7 we find «==1.8f,, which corresponds to P = 0.08, so that in this way we do not find any significant difference between the effects. In cases where we may assume that the x and y columns are in- dependent, both the above methods are available, and if either test shows a clearly significant difference, we must regard the null hypo- thesis as disproved, even if the other test fails to detect any signifi- cant difference. — In the case actually before us in Table 31.3.7 there is, however, an obvious correlation between the x and y columns due to the fact that corresponding figures refer to the same patient, so that it is not legitimate to apply the second method. Ex. 5. For the July temperatures in Stockholm for the w = 100 years 1841 — 1940, we have (cf Table 30.4.2) the mean x = 16.982 and the s. d. s = 1.6H6. For the 30 first and the 30 last years of the period, the means are respectively 16.898 and 17.468. Are these group means significantly different from the general mean 16.982? From the expression (29.4.5), we obtain f= — 0.86 for the i = 30 first years, and ^=1.97 for the 30 last years, in both cases with Fig. 31. Prices of potatoes at 46 places in Sweden, December 1936 (ar), and December 1937 (y). Regression lines: . Orthogonal regression line: . 30—464 H. CramSr 465 31.3 M — 2 = 98 d. of fr. Both values lie below the 5 % limit, so that this test does not indicate any sig^nificant chang-e in the summer temperature during the century. Ex. 6 . Fig. 31 shows the distribution of the prices of potatoes (ore per 100 kg) in December 1930 (a:) and December 1937 (v), at = 46 places in Sweden, according to official statistics. The ordinary char- acteristics of the sample are J = 660.67, // == 7 32.59, .Vj == 106.80, = 120.91, y = 0.7928, 0.7007, = 0.8971. Let us assume that the (a;, i/)-values form a sample from a normal population, and ^hat we wish to obtain information about the unknown values of the regression coefficient (in and the correlation coefficient p of this population. According to (29.8.4), the variable t — ' Vh-2 (fcg, — has Stu- .<fj V 1 — dent’s distribution with )f — 2 d. of fr. Introducing the values of the sample characteristics given above, we may thus test the hypothesis that is equal to any given quantity c. If we are working on a p % level, this hypothesis will be regarded as consistent with the data if c is situated between the limits ± Vn — 2 where fp denotes the p % value of t for v — 2 d. of fr., while other wise the hypothesis will be rejected. These limits are the p % con- fidence limits for ^21 ^ above). In the actual case we obtain in this way the following confidence limits for ^ 21 - p = 5 % 0.687 and I. 107 , JO = 1 % . . 0.617 and 1.177, P = 0a Vo 0.530 and 1.264. For the sample correlation coefficient r == 0.7928, we have by (27.8.1) and (27.8.2) approximately the mean q and the standard error d(r) = (1 — r*)/Kw = 0.0648. If the sampling distribution of r shows a sufficiently close approach 466 31.3 to normality, this may be used to test the hypothesis that ^ is equal to any given quantity. However, the sampling distribution of r tends rather slowly to normality, when ^ differs considerably from zero, ’and for n = 46 it must be expected that the results obtained by the use of the standard error are not very accurate. It is thus preferable to use the exact tables of the /-distribution (David, Ref. 261) or the logarithmic transformation (29.7.3) — (29.7.4) due to R. A. Fisher. In the latter case, we have to regard ^ J log ^ as Horni- er ally distributed, with the mean i log ^ + “ J ^ 2 (// — 1 and the s. d. \/V — 3, so that the variable A - I « - i{i ios - (i log is normal (0, 1). Working on a p % level, we are thus led to regard the data as consistent with any hypothetical value of q, if 11 1 4- p Q falls between the limits \ log [ ± . 1 r y — 3 where Ip is the p % value of a normal deviate, while otherwise the hypothetical value will be rejected. When r is known, these limits may be calculated for any p, and the corresponding values of p are then obtained by the numerical solution of an equation of the form 2 log -j-^ p ^ '^( ^ — l] ~ ^ These values are the p % confidence limits for p. In the actual case, we obtain the following confidence limits for p: = 5 % ().618G and 0.8783, p= 1 % 0.r>9i3 and 0.8980, p = 0.i Vo 0.5104 and 0.9171. Ex. 7. Table 31.3.8 gives the values (talien from official records) for the « = 30 years 1913 — 1942 of the following four variables: = average yield of wheat (autumn sown) in kg per 10^ m* for 20 rural parishes in the district of Kalmar (Sweden). 467 31.3 Table 31 . 3 . 8 . Wheat yield, temperature and rainfall in the Kalmar district. Year Wheat yield .r, Winter temperature X‘, Summer temperature ■r. Kainfall liest linear | estimate of .7*, 1013 1990 2.7 12.8 230 2126 1014 1960 3.1 13.7 268 2295 1915 1630 1.9 12.0 188 1899 i 1016 1720 1.8 11.7 316 2068 ; 1917 1660 1.0 12.7 180 1794 1918 1680 1.6 12.0 261 2004 j 1919 1980 2.8 12.2 216 2017 1920 2180 1.7 12.8 346 2223 1921 2370 3.1 13.1 131 1995 I 1922 1790 1.1 11.8 256 1918 1 1923 2400 1.6 11.2 327 2100 1924 1410 0.1 11.8 320 1913 1925 2670 3.7 13.2 382 2680 i 1926 2180 1.1 12.6 279 1996 1927 2160 2.6 1 12.2 361 2313 1928 2530 1 0.8 1 10.5 324 j 1966 1920 2100 0.8 j 1 10.9 196 1718 mmm 2330 3.6 1 12.4 381 i 2529 1 mgm 1860 1.6 10.7 273 i 1970 ■IB 2230 1.9 12.6 289 I 2123 1033 2610 2.2 11.9 338 2234 1934 2600 3.0 13 6 267 2271 1936 2480 3.2 12.8 372 i 2453 1 1936 1940 ?.8 12.8 367 1 2370 1937 2770 2.1 13.5 358 i 2332 2670 3.8 12.9 1 1 2154 2610 ! 3.8 13.4 311 2461 1420 -l.J 11.8 172 1434 810 — 0.4 11.8 194 1672 1 1990 1 -2.4 11.2 261 1434 468 31*3 ojg = mean Celsius temperature of the air at Kalmar during the preceding winter (October — March). = mean Celsius temperature of the air at Kalmar during the actual vegetation period (April — September). = total rainfall in mm during the vegetation period, average for three meteorological stations in the district. In this case it seems reasonable to regard the variables x^, .Tg and as causes, each of which contributes more or less to the value of the yield x^. It is required to investigate the nature of the causal relations between the variables. When the data are so few as in this example, we cannot hope to reach very precise results, but have to be satisfied with some general indications with respect to the signi- ficance or non-significance of the various possible influences. We shall assume that the joint distribution of the four variables is normal. The correlation matrix R— {r,j) of the sample is •1 0.59107 0.41082 0.461 20 0.6910J 1 0.67028 0.31888 0.41082 0.67028 1 0.1 0720 .0.46120 0.31888 0.10720 1 The determinant It = | Vtj \ is the square of the scatter coefficient (cf 22.7) of the sample. If the Xi are independent, we have by (29.13.2) E(R) = 0.80C and D(R) approximately = 0.116. From the above matrix, we actually find R = 0.‘J73, so that a dependence between the variables is clearly indicated. The significance of the various r,j may be judged by means of the distribution (29.7.5), which holds for nj if Xi and xj are independent. According to (29.7.6), the hypothesis that Xf and xj are independent will be disproved on the p % level, if |ro | exceeds the limit tp/V tp v where tp is the p % value of t for v = ?i — 2 d. of fr. A table of this limit for various values of ?i and p is given by Fisher and Yates (Ref. 262). For the usual 5 %, 1 % and 0.1 % levels, the values of the limit are D. of fr. p=^b% p=\% p = 0.1 V = 26 0.3740 0.4786 0.5880 II 0.3673 0.4706 0.5790 II GC 0.3609 0.4629 0.5703 469 31.3 For our r, j we have y = — 2 = 28 d. of fr., so that all except and r,* exceed the 5 % limit, lies between the 5 % and 1 % limits, and r |4 is almost equal to the 1 % limit, while r^g and rgg even exceed the O.i % limit. It is interesting to note that 7*,g is con- siderably larger than rjg, which seems to indicate that the temperature of the last winter has a greater influence on the yield than the tem- perature of the summer. The partial correlation coefficients may be calculated froni (23.4.3), and we find the following values: ^12 3 — 0.4r)6() Vi‘A 2 — 0.0244 ^14.2 — 0.3670 ri 2 4 == 0.5281 1*13 4 = 0.409f) ri4.3 = 0.4602 For the significance limits of the Uj.k, we have by (29.13.5) an ex- pression of the same form as for the r, with y = « *— 3 = 27 d. of fr. Among the six coefficients given above, it is thus only > 12.4 that ex- ceeds the 1 % limit, though both ri 2.3 and r^.s He very close to this value. If we compare e. g. r ,3 = 0.41082 with the values given for ri 3 2 and ^* 18 . 4 , we find that the elimination of the influence of the winter temperature has reduced the correlation between the yield and the summer temperature x^ to the completely insignificant value 1*13 2 = 0.0244, while the elimination of the rainfall x^ has practically no effect on the correlation. On the other hand, the comparison between r ,2 = 0.69l08 and ri 2 3 or ri 2 4 shows that the correlation be- tween yield and winter temperature is not substantially reduced by the elimination of summer temperature or rainfall. With respect to the situation is much the same as for — These comparisons seem to suggest the conjecture that the winter temperature and the rainfall x^ are the really important factors, while the influence of the summer temperature x^ is mainly due to the fact that Xt^ is rather strongly correlated with Xf^ (r^g = 0.67028). The partial correlation coefficients with two secondary subscripts are calculated from (23.4.4). , We find '^*12 34 — 0.8739, ^*13.24 ” 0.0848, ?’14,23 “ 0.3650, and these values seem to support the above conjecture, though none of them is strictly significant. We have here y = w — 4 = 26 d. of fr., and the 5 % significance limit for rij,ki is 0.3740. Consider now the multiple correlation coefficients. By means of (23.5.3) we find 470 31.3 ri (23) = 0 . 6914 , n ,24) = 0 . 6676 , n ( 34 ) = 0 . 5872 , ( 234 ) ” 0 . 060 (). The comparison between 0.6911 and ri( 23 ) = 0.6914 confirms the results already obtained, since it shows that the knowledge of adds practically nothing to our information with respect to the yield Xi, when we already know Similarly, the multiple correlation coefficient ri( 24 ) is not appreciably smaller than ri( 234 ). If the variables are independent, the product ^^ri( 2 ...A) is by (29.13.9) for large n approximately distributed in a x’ distribu- tion with i — 1 d. of fr. In the actual case, we find ri ( 34 ) == 10.341 with 2 d. of fr., and = 13.092 with 3 d. of fr. Since rn 2 n} and ri( 24 ) are both greater than ricu), it is thus seen that all four mul- tiple correlation coefficients given above are significantly greater than zero. Finally, we find the partial regression coefficients ^ 12.34 = 133 06, corresponding to f == 2.066, = 44.87, V .. / = 0.434, ^^14 23 = 1.9963, » 1.999, where the f-values are calculated from (29.12.1), under the hy])othesis that the corresponding population values are zero. We have Fig. 32. Wheat yield .r, : — . Be.st linear estimate .rf: 471 31.3 26 d. of fr. for t, and thus by Table 4 none of the three values is significant, though ^J 4 near the 5 % limit. If we identify the observed J-values with the unknown population values, this would mean e. g. that an increase of one degree in the mean winter temperature would on the average produce an increase of about 134 kg in the yield per 10^ m*, summer temperature and rain- fall being equal, whereas the corresponding figure for an increase of one degree in the summer temperature would only amount to 45 kg. The equation of the sample regression plane for gives the best linear estimate of the observed values of in terms of .Tg, and Xj^: x\ = 133.65 -I- 44.87 X^ -f 1.9963 X^ + 730.9. The values of x\ calculated from this expression are given in the last column of Table 31.3.8. The values of x^ and x\ are also shown in Fig. 32. It should be borne in mind that, in all tests treated above, we have throughout assumed that we are concerned with samples obtained by simple random sampling (cf 25.2). This implies, i. a., that the sample values are supposed to be mutually independent. In many applications, however, situations arise where this assumption cannot be legitimately introduced. Cases of this character occur, e. g., often in connection with the analysis of statistical lime series. Unfortunately, considera- tions of space have prevented the realization of the original plan to include in the present work a chapter on this subject, based on the mathematical theory of random processes. A discussion of the subject will be found in the dissertation of Wold (Eef. 246 a). 472 Chapters 32-34. Theory of Estimation.^) CHAPTER 32. Classification of Estimates. 32.1. The problem. — In the preceding chapters, we have repeatedly encountered the problem of estimating certain population parameters by means of a set of sample values. We now proceed to a more systematic investigation of this subject. The theory of estimation was founded by R. A. Fisher in a series of fundamental papers (Ref. 89, 96, 103, 104 and others). In Chs 32 — 33, we shall give an account of some of the main ideas introduced by Fisher, completing his results on certain points. In the present chapter, we shall be concerned with the classification and properties of various kinds of estimates. We shall then in Ch. 33 turn to con- sider some general methods of estimation, particularly the important method of maximum likelihood due to R. A. Fisher. Finally, Ch. 34 will be devoted to an investigation of the possibility of using the estimates for drawing valid inferences with respect to the parameter values. Suppose that we are given a sample from a population, the distri- bution of which has a known mathematical form, but involves a certain number of unknown parameters. There will then always be an infinite number of functions of the sample values that might be proposed as estimates of the parameters. The following question then arises: How should we best use the data to form estimates*? This question immediately raises another: What do we mean hy the ^hest» estimates^ We might be tempted to answer that, evidently, the best estimate is the estimate falling nearest to the true value of the parameter to be estimated. However, it must be borne in mind that every estimate is a function of the sample values, and is thus to be regarded as an observed value of a certain random variable. Consequently we have A considerable part of the topics treated in these chapters are highly contro- versial, and the relative merits of the various concepts and methods discussed here are subject to divided opinions in the literature. 473 32 . 1-2 no means of predicting^ the individual value assumed by the estimate in a given particular case, so that the goodness of an estimate cannot be judged from individual values, but only from the distribution of the values which it will assume in the long run, i. e. from its sampling distribution. When the great bulk of the mass in this distribution is concentrated in some small neighbourhood of the true value, there is a great probability that the estimate will only differ from the true value by a small quantity. From this point of view, an estimate will be > better » in the same measure as its sampling distribution shows a greater concentration about the true value^ and the above question may be expressed in the following more precise form: Hctv should we use our data in order to obtain estimates of maximum concentration? — We shall take this question as the starting-point of our investigation. We have seen in Part II that the concentration (or the comple- mentary property: the dispersion) of a distribution may be measured in various ways, and that the choice between various measures is to a great extent arbitrary. The same arbitrariness will, of course, appear in the choice between various estimates. Any measure of dispersion corresponds to a definition of the »be8t» estimate, viz. the estimate that renders the dispersion as expressed by this particular measure as small as possible. In the sequel, we shall exclusively consider the measures of dis- persion and concentration associated with the variance and its multi- dimensional generalizations. This choice is in the first place based on the general arguments in favour of the least-squares principle ad- vanced in 15.6. Further, in the important case when the sampling distributions of Our estimates are at least approximately normal, any reasonable measure of concentration will be determined by the second order moments, so that in this particular case the choice will be unique. — For a discussion of the theory from certain other points of view, the reader may be referred to papers by Pitman (Ref. 198, 199) and Geary (Ref. 116 a). It will be convenient to <5onsider first the case of samples from a population, the distribution of which contains a single unknown para- meter. This case will be treated in 32.2 — 32.5, while 32.6 — 32.7 will be devoted to questions involving several unknown parameters. An important generalization of the theory will be indicated in 32.8. 32.2. Two lemmas. — We shall now prove two lemmas that will be required in the sequel. Each lemma is concerned with one of the 474 32.2 two simple types of distributions, and there is a general proposition of which both lemmas are particular cases. The general proposition will, however, not be given here. Lemma 1. Suppose that^ for every a belonging to a von-degenerate interval A, the function g[x\a) i$ a fr.f in x, having the frst moment and a finite second moment. Suppose further that, for almost Q all X, the partial derivative ~ exists for everif a in A, and that u a dg d a < Gq{x), where G dip Then the deriratiiw da Q and X Gq are integrable over (— oo^ oo). — exists for every a in A, and tve have oo oo (32.2.1) j (x — aYg{x\ a)dx‘ ^ 0 {x\ a)dx The sign of equality holds here, for a given value of when and only tvhen there exists a quantity Ic, ivhich is independent of x hut may de- pend on a, such that (32.2.2) ^ = l([x — a) ^ ' a a for almost all x satisfying g[x\ a) > 0. By hypothesis we have for every a in j4 OO (32.2.3) f g (x] a)dx = l, J xg {x\ «) dx^iff (a), — oo — oo and the conditions of 7.3 for differentiation under the integral sign satisHed for both integrals, so that exists and is given by are the expression’) ''y-. da J da J da J . , ^ log f/. , (r — a) \ g - ^ ^ » g dx. Oq _ If //(.r; a) ^ 0 for all in a certain interval, we must also have -0, as r> log 0 y _ } > 11 oUiervvise g would assume negative values. The expression g - then be given the value zero. 47i"> 32.2 The relation (32.2.1) then immediately follows by an application of the Schwarz inequality (9.5.1).^) In (9.5.1) the sign of equality holds when and only when there are two constants w and v, not both equal to zero, such that ug[x) -f ■f vh(x) = 0 for almost all (P) values of x. Since (x — cc)Vg cannot vanish for almost all x it follows that, for a given value of or, the sign of equality holds in (32.2.1) when and only when 0 log q ^ , for almost all x, where k is independent of x. This completes the proof of the lemma. We give two examples of oases where the relation (32.2.2) is satisfied. Accord- ingly, it will be easily verified that in both these cases the sign of e(|iiality holds in (32.2.1). Ex. 1. The normal disfrihufion with mean a and constant s.d. Taking (a— «)» g (.r ; «) = c <J 1 2 7t 2 o’* where a is independent of x and «, vve have — a and ^ for all X and «. Ex. 2. The /^ distribution. By (18.1.6), the fr. f. k^j(x) of the ;r*-distributioD ?7 /h ir\ has the first moment n. Thus the fr. f. g(x; a)= ^ I — I, where a > 0, has the first moment V («) — «, and ^ye obtain from (18.1.3) ^ = --{x — a) for all 0 a 2 X > 0 and a > 0. Lemma 2. Suppose that, far every a belonging to a non degenerate interval A, the finite or enumerable sequence of functions p y^[a), p^[a), . . . are the probabilities of a distribution of the discrete type, the corresponding mass points Wj, . . . being independent of a. Suppose further that the distribution has the first moment xp(a) and a finite second moment, and that the derivatives pi[a) exist for all i and for every a in A, and are such that the series 2uipi(a) converges absolutely and uniformly in A. — Then the derivative exists for every a in A, and we have d a ( 32 . 2 . 4 ) ^) 1 am indebted to professor L. Ablfors for a remark leading to a simplification of my original proof of (32.2.1). 47(5 32.2-3 J III’ oj e<iu(ilitij holds here, for a gireii vahte of a, irhen and only trhen there exists a quantity h. which is independent of i hut may de- pend on a, such that (:i2.2.5) d log ;>/ d a = k («, — a), for all i satisfying p, (a) > 0. This is strictly analogfous to Lemma 1, and is proved in the same vray, by means of the following relations which correspond to (32.2.3); 1 . ^Uip,{a) = Ah in the previous case, we give two examples of cases where the relation (32.2,6) is satisfied; in both cases it will be easily verified that the sign of equality holds in (32.2.4). Ex. 3. For the binomial distribution with p — a/n, wo have u,- = i and (tcfnYil — 0L‘ny-'‘t where i — 0, 1, ...» n. Hence the mean in yf(a) — np = cc and we have - d log Pf d OL n — i n — a ■ —7 (m ; — a). a{n — cl) * £x. 4. When n -* co while a remains fixed, the binomial distribution tends to the Poisson distribution with w,- = / and ^ Here we have yf[a''=^(K and d log Pj. — a da a 32.3. Minimum variance of an estimate. Efficient estimates. — Suppose that, to every value of the parameter a beloiig^ing^ to a non- degcenerate interval there corresponds a certain d. f. F{x] a). Let .r,, . . Xn be a sample of n values from a population with the d. f. F{x\ a), where a may have any value in A, and let it be required to estimate the unknown »true value » of a. We shall use the general notation a* — a* (.r^, . . ., Xn) for any function of the sample values') proposed as an estimate of «. In the paragraphs 32.3 — 32.4, the size 7? of the sample will be considered as a fixed number ^1. In 32.5, we proceed to consider M It is important to observe the different signification of the symbols a" ana a. 15y definition, a* is a function of the sample values .fi, . . ., which are conceived as random variables. Thus a* is itself a random varinblSf possessing a certain sampling distribution. On the other hand, cc is a variable in the ordinary analytic sense which, in the population corresponding to a given sample, may assume any constant, though possibly unknown, value in A. 477 32.3 questions related to the asymptotic behaviour of our estimates when 11 is large. According to the terminology introduced in 27.6, a* is called an unbiased estimate of a, if we have E(a*) = a. As shown by some simple examples in 27.6, it is often possible to remove the bias of an estimate by applying a simple correction, so that an unbiased estimate is obtained. In the general case, however, an estimate will have a certain bias b(a) depending on or, so that we have £(«*)="« + It can be shonn that, subject to certain (fcneral conditions of rerjU' larity, the mean square deviation E(a* — aY can never fait betoiv a po- sitive limit depending only on the d. f. \ or), the size n of the sample, and the bias b{a). In the particular case when is unbiased ivhatever he the true value of a in A, the bias b[tt) is identically zero, and it follows that the variance D^(a*) can never fall below a certain limit depending oidy o)i F and ii. We shall restrict ourselves to proving this theorem for the case when the d. f. F(x] a) belongs to one of the two simple types. 1. The continuous type. — Consider a distribution of the continuous type, with the fr. f . f[x\ a), where a may have any value in A. The values Xn obtained in n independent drawings from this distri- bution are independent random variables, all of which have the same fr. f. f{x] a). Each particular sample will be represented by a definite point X ■■= (Xi, . . ., Xn) ill the sample space Rn of the variables . . ., .r?,, and the probability element of the joint distribution is L(Xi, . . Xn] a)dx^ . . . dXn =f{x^] a) . . . f(xn] a) dx^ . . . dXn. The joint fr. f. Z/=/(x, ; a) . . , f{xn] a) is known as the likelihood function of the sample (cf 33.2). Let now a* — a* (xi, . . ., Xn) be a unique function of x^, . . ., Xn not depending on a, which 'is continuous and has continuous partial derivatives -- — in all points x, except possibly in certain points be- O X\ longing to a finite number of hypersurfaces. We propose to use a* as an estimate of a, and suppose that E(a*) = a + b(a), so that b(ci) is the bias of a*. The equation a* = c will, for various values of c, define a family of hypersurfaces in Rn. and a point in Rn may be uniquely deter- 478 32.3 mined by the value of a* corresponding* to the particular hypersur- face to which the point belongs, and by n — 1 local » coordinates 51 1 • • -1 which determine the position of tlie point on the hypersur- face. We may now consider the transformation by wliich the old vari- ables Xn are replaced by the new variables tr* and C^hoosing the » local » coordinates such that the transformation satisfies the conditions A) and B) of 22:1, the joint fr. f. ol the now variables will then be /Ur, «) - . ./U»/; where J is the Jacobian of the transformation, and the have to be replaced by their expressions in terms of the new variables. The random variable c* will have a certain distribution, in general dependent on the parameter «, and we denote the corresi)onding fr. f. by g[fx*\ a). Further, the joint conditional distributioji of corresponding to a given value of will have a fr. f. wliich we denote by /i(5i, ■ • •, Bj (22.1.1) we then have (32.3.1) /(x,; «) . . .f{x„\ a)\J\^ a)h(i ;„-i \u^; «), and the transformation of the probability element aceordingr to (22.2.3)- may thus be written (82.3.2) f(xi \ a) . , ,/(.r„; . . . (/xu = ~ g (a* \ . . , 5»-j \(c*\ (^fhc" the Suppose now that, for almost all values of .r, a*, i',, . . partial derivatives and exist for every a in .1, and that ' () (X 0 a d a 0_g da < ^A)(«*), dh da <i/,(,s, ..., ^Vi, where Gq, a* Gq and are integrable over the whole space ol the variables x, a*, a* and . . . , respectively. We shall then say that we are concerned with a regular eshwation vase of the co}itinuous type, and or* will be called a regular esiime^te of a. We now pro- ceed to prove the following main theorem. In any regular estimation case of the continuous type, the mean square deviation of the estimate a* from the true value a satisfies the inequality 479 32.3 (32.3.3) E(a* -aY Vr (IhV da) a) d.r The sign of equality holds here, for every a in A, when and only when the following two conditions are satisfied whenever r/ (or* ; a) > 0 : A) The fr,f. h(^i, . . a) is independent of a. B) We have = A(a* — or), where is independent of a* hut may depend on a. In the particidar case when or* is unbiased tvhatever he the value a in A, we have h (a) = 0, and (32.3.3) reduces to (32.3.3 a) of From our assumptions concerning the functions / and h, it follows according to 7,3 that the relations oo oo OO f/{x; a)dx — f ■ f /t(|„ . . |a*; = 1 — 00 — 00 — 00 may be differentiated with respect to a under the integrals. The re- sulting relations may be written (32.3.4) ajd^i . . . = 0. Taking the logarithmic derivatives with respect to or on both sides of (32.3.1) we obtain, the Jacobian being independent of or, (32.3.5) d log /(xr, a) da 0 log g 0 log h da ^ da We now square both members of this relation, multiply by (32.3.2), 480 32.3 and integrate over the whole space. According to (32.3.4) all terms involving products of two different derivatives vanish, and we obtain (32.3.6) oo oo — oo — oo — OO The above proof of this inequality is due to Dugue (Ref. 76). The sign of equality holds here when and only when - - = 0 in almost o a all points such that r/ > 0, i. e. when the condition A) is satisfied. Finally, the fr. f . ff(a*;a) satisfies the conditions of Lemma 1 of the preceding paragraph, with tp(a) = a + b (a), and an application of that lemma to the inequality (32.3.6) now immediately completes the proof of the theorem. The integral occurring in the denominators of the second members of (32.3.3) and (32.3.3 a) may be expressed in any of the equivalent forms E d'jr. It will be readily seen that the above theorem remains true when we consider samples from a multidimensional population, specified by a fr. f. f{x^, . . Xk\ a) containing the unknown parameter or. Consider now the case when the estimate a* is regular and un- biased. The second member of (32.3.3 a) then represents the smallest possible value of the variance (a*). The ratio between this minimum value and the actual value of D^(a*) will be called the efficiency of and will be denoted by e(a*). We then always have 0 ^ e(a*) ^ 1. When the sign of equality holds in (32.3.3 a), the variance D*(a*) attains its smallest possible value, and we have e(a*) = l. In this case we shall say that a* is an efficient estiniate^). These concepts are due to R. A. Fisher (Ref. 89, 96). *) As a rule this term ia used with reference to the behaviour an estimate in large samples, i. e. for infinitely increasing values of n. However, we shall here find it convenient to distinguish between an efficient entiinate, by which we mean an 481 31 — -164 H. Cramer 32.3 It follows from the above theorem that a regular and unbiased estimate is efficient, when and only when the conditions A) and B) are satisfied. This becomes evident, if e[a*) is written in the form Both factors in the last expression are ^ 1, and the efficiency attains its maximum value 1 when and only when both factors are = 1. The first factor is = 1 when and only when the condition A) of the above theorem is satisfied, while the second factor has the same relation to condition B). — When an efficient estimate exists, it can always be found by the method of maximum Jikehood due to R. A. Fisher (cf 33.2). Let now a\ be an efficient estimate, while aj is any regular un- biased estimate of efficiency e > 0. We shall show that the correlation eoifficient of ot) “ fact, the regular unbiased estimate a* = ( 1 — k) a* -h k a* has the variance D’ ("■) - ^( 1 - if + - + 7) »’ («;) = - ( 1 + 2 * ^ ’ +- ' ) D- and if p k c, the coefficient of D^{a*} can always be rendered < 1 by giving k a sufficiently small positive or negative value. Then it would follow that D^{a*) < and the efficiency of a* would be > 1, which is impossible. In particular for €=1 we have p=l. Thus two efficient esti- mates at ^>^2 have the same mean a, the same variance, and the correlation coefficient p=l. It then follows from 21.7 that the total estimate of minimnm variance for a given finite size n of the sample, and an asymptotically efficient estimate (cf 32.6), which has the analogous property for samples of infinitely increasing size. An efficient estimate exists only under rather restrictive conditions (cf 32.4), whereas the existence of an asymptotically efficient estimate can be proved as soon as certain general regularity conditions are satisfied (cf 33.3). 482 32.3 mass in the joint distribution of a* and a* is situated on the line ^ ttvo efficient estimates of the same parameter are » almost always » equal. We show in this paragraph several examples of efficient estimates (Ex. 1- 2 for the continuous case, Ex. 6 — 6 for the discrete case). It will be left to the reader to verify that, in each case, the conditions A) and B) for efficient estimates are satisfied. In order to do this — we talk here of the continuous case, but in the discrete case everything is analogous — he will first have to tind the fr. f. g {a * , a) of the estimate concerned, and thou the examples given in :i2.2 will directly provide the verification of condition B;. Further, a convenient set of auxiliary variables ^1 , . . 5h_i should be introduced, and the conditional fr. f. h should be calculated from (32.3.1); it then only remains to verify that h is independent of cc. — In all examples, except in Ex. 4, we are dealing with regular estimates only. The reader should verify this in detail at least in some cases. Ex. 1. The mean of a normal population. Writing /(or; m) c ' 2 where « == m is the parameter to be estimated, while a is a known constant, we may choose for A any finite interval, and obtain /is. Consequently the variance of any regular unbiased estimate m* satisfies the inequality L- (t’Vw. For the particular estimate m* = a; — i ar^/n we have by 27.2 E(j') — m and D*(ir) = a‘Vw. so that the mean is an efficient estimate of m. Accordingly we have seen above that certain other possible estimates of m. such as the sample median (cf 28. 6\ and the mean of the v th values from the top and from the bottom of the sample (cf 28.6.17) have a larger variance than .r. It is instructive to consider various other functions of the sample values that might be used as unbiased estimates of ni\ it will be found that the variance is always at least equal to o^ln. We give here a simple example of this kind. Con- sider a sample of w = 3 values from the normal distribution specified above, and let the sample values be arranged in order of magnitude; r, *£ ar,. It might then be thought that the weighted mean z ™ c j'i -f (1 ~ 2 c] or* + c JTg would, for some conveniently chosen value of c, be a »better» estimate of m than the simple arithmetic mean, which corresponds to c = J. We have, however, E{z) = m and D' {z) = ~+ (2 - 3 VI' (c - if, 3 7t so that the variance of z attains its minimum precisely when c = J. It will be left as an exercise for the reader to prove this formula, and to verify that the con- ditions for a regular estimate are safisfied in this case. 483 32.3 Ex. 2. The variance of a normal population. Writing f{x\ O - 1 V 27r<;* e " 2> where a — is the parameter to be estimated, while tn is a known constant, we may choose for A my finite interval a < a* < b with a > 0, and obtain E 1 2(j* Consequently the variance of any regular unbiased estimate of a* is at least equal to 2 <jVw. Correcting the sample variance s* for bias (cf 27.6), we obtain the expression n 1 V (jT^- ~ .!■)*» which by (27.4.6) is an unbiased estimate of a* with the variance 2ff*/(n — 1). Obviously this is not an efficient estimate, but an estimate of efficiency (n — l)/n < 1. On the other hand, consider the estimate ~ S wi)V This is legitimate, since m is now a known constant. It is easily seen that has the mean a* and the variance 2 a*/n, and thus provides an efficient estimate of a*. Ex. 3. The s. d. of a normal population. If, in the distribution of £x. 2, we regard the s. d. c instead of the variance o’ as the parameter to be estimated, we find E I®.? j ~ — 00 Consequently the variance of any regular unbiased estimate of a is at least equal to <jV(2 w). Consider e. g. the expression where 8 is the s. d. of the sample. Jly (29.3.3) we have E{h) — a, and so that the efficiency c s') tends to 1 as « — w. For small n the efficiency is, how- ever, considerably smaller than I. Taking c. g. w — 2, we have c(s') — — - 0.4880, Z ' Tt — Z) while for 7^—3 we have eijf) = 7 ,vr— — : = O.6100. 6 (4 — 71 ) Similarly we find that the expression 484 32.3 ■■’Vi n') where «o is defined in Ex. 2, i.s an unbiased estimate of a, with variance y2/n/" + l\ J 2n \»iV The efficiency e («'„) lends to 1 ns « — oo . For n = 2 we have o,*V’ = — = 0.»151, 4 (4 — 7l) 4 while for w = 8 we have e («o) = — = 0.9858, considerably above the corre- «j 7t O) .^ponding figures for For the mean deviation «, = - | jr, — m |, we find by easy calculations SO that y 7tl2 is an unbiased estimate of a, with the efficiency ^ - = 0.8780. n — 2 Ex. 4. A non^regular case. When the fr. f. has discontinuity points, the posi- tion of which depends on the parameter, the conditions for a regular case are usually not satisfied. In such cases, it is often possible to find unbiased estimates of »ab- normally high» precision, i.e. such that the variance is smaller than the lower limit given by (32.3.3 a) for regular estimates. Consider e g. the fr. f. defined by f{x\ «) = for x «, and f{x\ a) = 0 for In the point X a the derivative ^ does not exist, so that this is a non- Oa X < «. regular case. be differentiated in the usual simple way; we have, in fact As we have seen in 7.3, the relation j f dx = 1 cannot in this case 1. When ^\e pass from (32.3.5 > to (32.3.61, all the n* terms in the first member will thus be equal to 1. Assuming that the functions g and h satisfy our conditions, we then obtain instead of vr 1/w, which w'ould follow from (32 3.3 a), only the weaker inequality ^ 1/a?*. For the particular estimate cc* = Min x^ — 1/n. where Min ./■, denotes the smallest of the sample values, we find the fr. f. n f[n a*; — 1 so that E(a*) ~=- tc, l^~(ci*)= 1/n*. Thus a* is an unbiased estimate, the variance of which is for all > 1 smaller than the limit given by (32 3.3 a;. ^ A further example of the same character is provided by the rectangular distri- bution, when we use the mean or the difference of the extreme values of the sainpie as estimates of the mean or the range of the population. According to (28.6 8j and (28.6.9), the variance is in both cases of the order w-2, and thus certainly falls hclow^ the limit given by (32.3.3 a', when n is large. 485 32.3 2. The discrete type. — Consider a discrete distribution with the mass points and the corresponding: probabilities Pi{a), Ps(a), . . where a may have any value in and the w/ are independent of a. This case is largely analogous to the previous case, and will be treated somewhat briefly. As in the previous case, we consider an estimate a* = a* Xn) with the mean E(a*) = « +• h(a). The probability that the sample point in Rn with the coordinates Xn assumes the particular position M determined by = w/,, . . Xn = ^Uf^ is equal to Pt,M • * point M may, however, also be determined by another set of « coordinates, viz. by the value assumed by a* in M, say at, and by 72 — 1 further coordinates Vi, . . y,i-i which determine the position of M on the hypersurface a* = a*. If Qv (a) denotes the probability that a* takes the value aj, while 1 1 (a) is the conditional probability of the set of values of Vj, . . ., corresponding to for a given v, we have the fol lowing relation which corresponds to (32.3.2): (32.3.8) pt, (a) . . . pi^ (a) = q. (a) n,, , ,.„_j | . (a). We now define a regular estimatioyi case of the discrete type by the condition that, for every a in A, all derivatives pi{a), ql[ci) and .,v„_jiv(«) exist and are such that the series which I correspond to the analogous integrals considered in the continuous case, converge absolutely and uniformly in A. We shall then also call a* a regular estimate of a. In any regular estimation case of the discrete type^ ive have the inequality corresponding to (32.3.3): (32.3.9) The sign of equality holds here^ for every a in A, when and only ndien the following tivo conditions are satisfied trhenever qv(a)>i)\ A) The conditional prohahility Vr^. .r„_iiT(a) is independent of a. B) We have — a), where k is independent of v hut may depend on a. 48fi 32.3 1)1 the particular case when a* is unbiased whatever he the value of a in A, toe have h (a) = 0, and (32.3.9) reduces to (32.3.9 a) y Id log piV da ) Pi The proof of this theorem follows the same lines as the corre- sponding proof in the continuous case. ^W^e take the logarithmic derivatives on both sides of (32.3.8), square, multiply by (32.3.8), and then sum over all possible sample points JMT. By means of Lemma 2 of the preceding paragraph, the truth of the theorem then follows. As in the continuous case, an unbiased estimate will be called efficient, when the sign of equality holds in (32.3.9 a). The definition of the efficiency of an estimate, and the remarks concerning the cor- relation between various estimates, extend themselves with obvious modifications to the discrete case. The expressions (32.3.3 a) and (32.3.0 a) are particular cases of the general inequality («•) which holds, under certain conditions, even for a d. f. F{x\a) not belonging to one of the two simple types. The integral appearing here is of a type known as HeM- ingers integral (cf e. g. Hobson, Ref. 17, I, p. 609). We shall not go into this matter liere, but proceed to give some further examples of efficient estimates. Ex. 5. For the binomial distribution we have where a^p i.s the parameter to be estimated, while A is a known integer, and g = 1 — p. Then /rilogpA* ^li N-iV _ N \\ dp ) ^\p 9 ) Pq Thus the variance of any regular unbiased estimate p* from a sample of n values is at least equal to For the particular estimate p* n A E[p*)=^p and so that this is an efficient estimate. n A Ex. 6. For the Poisson distribution with the parameter we have == v-, e and Tims the variance of any regular unbiased estimate is at least equal to ?Jn. lor the particular estimate A* = ir = we have E(?*^ — X and so that this is an efficient estimate. 487 32.4 32.4. Sufificient estimates. — In order that a regular unbiased estimate a* should be efficient, i. e. of minimum variance, it is necessary and sufficient that the conditions A) and B) of the preceding para- graph are both satisfied. If we only require that condition A) should be satisfied, we obtain a wider class of estimates. We now proceed to consider this class, restricting ourselves to distributions of the continuous type, the discrete case being perfectly analogous. For the continuous case, condition A) requires that the conditional fr. f . ..., should be independent of a, whenever a) > 0. This means that the distribution of mass in the infini- tesimal domain bounded by two adjacent hypersurfaces a* and d is independent of cc. In such a case, the estimate a* may be said to summarize all the relevant information contained in the sample with respect to the parameter «. In fact, when we know the value of a"' corresponding to our sample, say a*, the sample point M must lie on the hypersurface a* = a*, and the conditional distribution on this hypersurface is independent of a, so that the further specification of the position of M does not give any new information with respect to a. Using the terminology introduced by R. A. Fisher (Ref. 89, 96), we shall then call a* a sufficient estimate. Since in (62.3.1) the Ja- cobian J is independent of a, it follows that is sufficient if and only if (32.4.1) /(.r, ;«)... /(.r„; a)^g [a * ; a) U {.r,, . . ., cr„), where H is independent of or. From the nature of the conditions A) and B), it is fairly evident that efficient or sufficient estimates can only be expected to exist for rather special classes of populations. There are important connections between these classes of estimates, when they exist, and the maximum likelihood method (cf 33.2). For further information concerning the conditions of existence and other properties of efficient and sufficient estimates, the reader is referred to papers by R' A. Fisher (Ref. 89, 96. 103, 104 etc.), Ncyinan (Kef. 162), Neyinan and E. S. Pearson (Kef. 173), Koopman (Ref. 141), Dannois (Ref. 74), Dugue (Ref. 7()) and others. In Kx. 1, 2, 5 and 6 of the preceding paragraph, wc have considered various examples of efficient estimates. All these are, a fortiori, sufficient estimates. In each case, this can be directly shown by studying the transformation which replaces the original sample variables by the estimate a* and n — 1 further conveniently chosen new variables, and verifying that condition A is satisfied. The readier is 488 32.4-5 recommended to carry out these transformations in detail. ,Cf also the analogous case in 32.6, Ex. 1.) The estimate s,) defined in 32.3, Ex. 3, is an example of a regular unbiased estimate satisfying condition A) but not condition B), i.e. a sufficMent estimate which is not efficient. A further example of the same kind will be given in 33.3, Ex. 3. Phus the class of sufficient estimates is effectively more general than the class of efficient estimates. The above definition of a sufficient estimate, which applies to the class of regular and unbiased estimates, may be directly extended to the class of all regular estimates, whether unbiased or not. After this extension, it follows immediiitelv from the definition that the property of sufficiency is invariant under a change of variable in the parameter. Thus if a* is a sufficient estimate of the parameter u, and if we replace a by a new parameter <p!a\ then y(a*) will be a sufficient esti- mate of <p(ai. For efficient estimates, there is no corresponding proposition. 32.5. Asymptotically efficient estimates. — In the preceding^ para- gfraphs, we have considered the size w of the sample as a fixed in- teger 1. Let us now suppose that the regular unbiased estimate a* = a*{j\, . . ., .Pn) is defined for all sufficiently large values of and let us consider the asymptotic behaviour of a* as n tends to infinity. If a* converges in probability to a as // tends to infinity, a* is a confiistent estimate of a (cf 27.6). — In Chs 27 — 29, we have seen (cf e. g. 27.7 and 28,4) that in many important cases the s. d. of an esti- mate a* is of order for large w, so that we have D(a*)^cn~i, where c is a constant. If a* is unbiased and has a s. d. of this form, it is obvious that a* is consistent (cf 20.4). Further, in such a case the efficiency e(a*) defined by (32.3.7) tends to a definite limit as n tends to infinity: (32.5.1) lim e(c(*) = ^^(a*) = In the discrete case we obtain an analogous expression. This limit is called the asymptotic efficiency of or*. Obviously 0^e()(a*)^ 1. Consider further the important case of an estimate a*, whether regular and unbiased or not, which for large n is asymptotically nor- mal (or, clV^n). We have seen in 28.4 that this situation may arise even in cases when E(a*) and D{a*) do not exist. However, when 'n is large, the distribution of or* will then for practical purposes be equivalent to a normal distribution with the mean or and the s. d. c/Y and accordingly we shall even in such cases denote the quantity 489 32.5-6 e^ia*) defined by the last member of (32.5.1) as the asymptotic effi- ciency of a*. When eo(«’^) = l, we shall call a* an asympfotically efficient eaii' mate of a. Under fairly general conditions, an asymptotically efficient estimate can be found by the method of maximum likelihood (cf 33.3), Ex. 1. For the normal distribution, the sample median may be used as an estimate of w, and by 28.6 this estimate has the asymptotic efficiency 2(it — 0.6866. Thus if we estimate m by calculating the median from a sample of, say, 10 000 observations, we obtain an estimate of the same precision as could be obtained by calculating the mean J from a sample of only 2nln — 6366 observations. Never- theless, the median is sometimes preferable in practice, on account of the greater simplicity of its calculation. We may also use the arithmetic mean of the v.th values from the top and from the bottom of the sample as an estimate of m. Ky (,28.6.17), this is an estimate of asymptotic efficiency zero. When, in the normal distribution, m is known, and it is required to estimate the variance a* or the s. d. o, we may use various estimates connected with the sample variance s’*. In Ex. 2 — 3 of 32 3, we have alreiady met with some examples of asymptotically efficient estimates of this kind. — We may also use the difference between the v.th values from the top and from the bottom of the sample, multiplied by an appropriate constant, as an estimate of a. According to (28.6.18), this is an estimate of asymptotic efficiency zero. The use of this estimate in large samples would thus involve a »loss of information^ even greater than in the case of the sample median mentioned above. Nevertheless, the estimates of a as well as of w based on the v;th values may often be used in practice with great advantage, as their calculation is very simple, and the loss of information is not considerable for small values of n (cf the papers quoted in this connection in 28.6). Ex. 2. wc have For the Cauchy distribution with the fr. f. f{x\ fx) = [1 H (x — log A* ^ 4 \ d fi ) 71 — oo / Jx — ^)* 1 (a- — Thus the variance of any regular unbiased estimate of fx is at least equal to 2/m. liy 19.2, the sample mean 7; has the same fr. f. /{,ac. /a), so that the mean is not a consistent estimate of //. Neither is the arithmetic mean of the v.th values from the top and from the bottom of the sample (cf 28.6.11). On the other hand, the sample median is by 28.6 asymptotically hormal \ nl^ n\ and thus the median has the 2 tt* 8 ^ asymptotic efficiency - : - — ~ 0.8106. ‘ n 4 M n:* 32.6. The case of two unknown parameters. — We shall now briefly indicate how the concepts and propositions given in the pre- ceding paragraphs may be generalized to cases involving several un- known parameters. It will be sufficient to give the explicit statements 490 32.6 of the results for continuous distributions, as the correspondin}^ results for the discrete case follow by analogry. In order to simplify the writinfif, we shall further restrict ourselves to the case of unbiased estimates. In the present paragraph we shall consider a distribution with two unknown parameters a and specified by a fr. f . f{x\ a,/?). From a sample of n values ./*« drawn from this distribution; we form two functions or* a* .r„) and /?* x,), which are assumed to be unbiased estimates of a and respectively. We then consider a transformation in the sample space Jin, replacing the old variables Xn by « new variables or*, /?* and For this transformation we have the following relations corresponding to (32.3.1) and (32.3.2): n -/ n fW , ; «, /?) - g {a\ ir ■, a J)h (§„ . . !«*,/?*;«. /?), I 1 n n./V'' = = (J («*, h I a*. a, /?) da* d^ dl, . . . Here g is the joint fr. f. of a* and (i*, while h is the conditional fr. f . of ^ 1 , . . ., ^'«-2 for given values of or* and /?*. Finally J is a Jacobian independent of or and /?. A regular estimation case is now defined as a case where the fr. f :8 J\ g and h satisfy the regularity conditions stated in 32.3 with re- spect to both parameters or and Operating in the same way as in 32.3, though dealing with total differentials with respect to or and ^ instead of partial derivatives with respect to or, we obtain (cf Dugue, Ref. 76) (32.0.1) „ f '°r — 00 — oo / where the sign of equality holds when and only when the conditional fr. f. h is independent of or and whenever ^ > 0. In a case where this condition is satisfied, the estimates or* and (i* may be said to summarize all relevant information contained in the sample w^ith re- 491 32.6 spect to u and /9. In generalization of 32.4, we shall then say that a* and are joint fy^ufficient estimates of a and Both members of (32.6.1) are quadratic forms in da and d^. Owing to the homogeneity, the same inequality between the forms holds true even if da and dfi are replaced by any variables u and r, and thus (32.6.1) may be written (32.6.2) Consider now the inequality (32.2.1), which expresses the main result of Lemma 1 in 32.2, and suppose that ?/; (or) = or. The inequality (32.2.1) may then be written as an inequality between two quadratic forms in one variable: where <7 = <7 (a*; a) is a fr. f . with the mean E(a*) = a, and the form in the second member is the reciprocal of the form E (or* ~ or)* When expressed in this way, the lemma may be generalized to fr. frs involving several parameters (cf Cramer, Ref. 72; the detailed proof of this generalization will not be given here). In the case of two parameters, the generalized lemma asserts that the second member of (32.6.2) is at least equal to the reciprocal form of E (a* - af w* + 2 E [(a* - a) (/?* - /?)] m r 4- E (^* - ?;* - — of 4 2 p Oj o* uv 4 di ?*“, where Oj, and q denote the s. d:s and the correlation coefficient of a* and /?*, so that (32.6.3) 1 / 2 Q H r 1 — O, ffj ^ a]j Now the concentration ellipse of the joint distribution of a* and has the equation (cf 21.10.1) 492 32.6 (32.6.4) - ^ ~ ~J) + /^)*\ _ 4 1— e \ 01 / The inequalities (32.6.2) and (32.6.3) thus imply that iJie fixed elUjixe 02.6,5) „ [e(«^<''-')’(„^„)- + 2E(^>»/'’^t/) + lies u'holl}/ within the concentration ellipse of any pair of regular un- biased estimates a*, /9*. — This is the generalization to two parameters of the inequality (32.3.3 a). When the sign of equality holds in both relations (32.6.2) and (32.6.3), we shall say that a* and are joint efficient estimates of or and /9. In this case the two ellipses (32.6.4) and (32.6.5) coincide, and the joint distribution of a* and has k greater concentration (cf 21.10) than the distribution of any non-efficient pair of estimates. Consider now a pair of joint efficient estimates orj and /JJ. The variances of aj and /?*, and the correlation coefficient between these two estimates, are obtained by forming the reciprocal of the quadratic form in the first member of (32.6.5): where Hence we obtain e. g. D^al) 1 - e* («:, ^*) 1 As soon as () the variance of a’ is thus greafer \ Oa 0[i / than the variance of an efficient estimate in the case when « is the 493 32.6 only unknown parameter (cf 32.3.3 a). Now, in a case when there are two unknown parameters it often arrives that we are only interested in, estimatings one of the parameters, say a, and we may then ask if it would be possible to find some other pair of regsular unbiased esti- mates a*, yielding a variance < D^(a*), no matter how large the corresponding D^(j9*) becomes. However, since the ellipse (32. (i. 5) lies wholly within the ellipse (32.6.4), the maximum value of the abscissa for all points of the former ellipse is at moat equal to the corresponding maximum for the latter ellipse. Hence we obtain by some calculation the inequality which shows that it z‘,v possihlr to jind a '>'>bettrv^'> estimate of a than a*. The ratio between the two-dimensional variance (cf 22.7) of a pair of joint efficient estimates a*, /:?*, and the corresponding quantity for any pair of regular unbiased estimates a*, will be called the joint efficiency of a* and and denoted by c(a*, /?*). This is identical with the square of the ratio between the areas of the ellipses (32.6.5) and (32.6.4), which by (11.J2.3) is _ 1 The concepts of asymptotic efficiency and asymptoticaUy efficient estimate (cf 32.5) directly extend themselves to the present case. As in 32.3, all the above results remain true in the case when we consider samples from a multidimensional [jopulation, specified by a fr. f. /(xj, . . ., .A-; a, ft) containing two unknown parameters. Ex. 1. When both piirameter.s a — m ami — O' of a normal (li.stril>ution arc unknown, we h.ive (cf 32.3, Ex. 1—2) \ Om t \ Om Oo^ 1 \ I so that in this ca.se the. optimum ellipse (32.6.6) becomes ill — mf .'V-~ _ 4 o* 2 a* n Consequently thi.s fixed ellip.se lies within the concentration ellipse of the joint distribution of any pair of regular unbiased estimates of m and For the particular — w pair of estimates «* = .r and /?* — , the relation 29.3.6) shows the trans- ^ n — 1 494 32 . 6-7 formation which replaces the sample variables j-,, . . x*,, by the new variables .7, s and Zj, . . factor in the expression of the fr. f. of the new variables represents the conditional fr. f. of z,, . . and this is independent of the unknown parameters m and o (and, in fact, also of jc and 8, but this is of no importance for our present purpose). Hence it follows that 7 and - s® arc joint sufficient eati* n — 1 mates of m and o^. Further, we have D* (7) D { .vA , and p ( 7 , - " A 0. n \n — 1 / n — 1 M - 1 / Thus the concentration ellipse of 7 and ^ s'* has the en nation w - 1 ( u — m?‘ ^ n “ 1 (v — _ 4 n 2 a* n v ' — 1 The square of the ratio between the areas of the two ellipses pjives the value n for the joint efficiency of the estimates. When n 00 , the efficiency tends to unity, — ft and thus ,r and ^ asymptotically efficient estimates of m and a*. The same holds, of course, also for 7 and though «* is not unbiased. Ex. 2 . (^insider a two-dimensional normal fr. f. ('21.12 1) Avith known values of a,, Oa and p, while a — and (i — are the t^^o unknown parameters. From a sample of n pairs of values we form the estimates a* — 7 and — y. It is then easily shown that in this case the concentration ellipse of the estimates .v and y coincides ^^ith the fixed ellipse (^32. 6. 5^ each having the ecpiation n l{u — n/,)* 2 p {li — r — ^ 1 -- (<* \ o, <J, fl’i / Thus 7 and y are joint efficient estimates and a fortiori joint sufficient estimates of and ni^. 32.7. Several unknown parameters. — The results of the preceding^ paragraphs may be generalized to distributions involving any number of unknown parameters. If a*, are any regular unbiased esti- mates of the unknown parameters at, it is shown in a similar way as in the case 1c = 2 that the fixed A’-dimeiisional ellipsoid (32.7.1) n d log / 0 log () ai 0 aj ,) (uj — ai) --■=k + 9 lies wholly within the concentration ellipsoid (cf 22.7) of the joint distribution of a*, . . ., a*. In the limiting case when the two ellipsoids coincide, we shall say that «*, . . ., a* are joint efficient estimates of aj, . . ., ak. Thus the distribution of a set of joint efficient estimates 495 32.7“ 8 has a greater comeatraiion (cf 22.7) than the distribution of any set of non-efficient estimates. The moment matrix of a set of joint effi- cient estimates is the reciprocal of the matrix of the quadratic form in the first member of (32.7.1), as shown in the preceding paragraph for the case of two parameters. — The concepts of sufficiency, efficiency, etc. are introduced in the same way as in the case 1c = 2. Ah an example, we consider a two-dimensional normal fr. f. with the live unknown parameters m,, I'^rom a sample of n pairs of values Vi\ • • .i "’® obtain the unbiased estimates ;7‘, y, ^ mjo, ~ ^ Wi,, and ‘ n — In — 1 n — for the five parameters (cf 29.6). The moment matrix of the joint distri- n — 1 bution of the five e.stimato8 can he calculated e. g. by means of the expression (20.6.3) of the joint c. f. of the estimates. Further, the coefficients in the equation (32.7.1) of the optimum ellipsoid may be found by introducing the expression of the fr. f. into (32.7.1) and performing the integrations. By simple, though somewhat tedious calculations, it will be found that the joint efficiency of the five estimates is • ^ben ?i, 00 , this tends to unity, so that the estimates are asymptotically efficient. 32.8. Generalization. — Throughout the present chapter, we have been concerned with the problem of estimating certain parameters from a set of values, obtained by independent drawings from a fixed distribution. However, our methods are applicable under more general conditions. Consider e. g. the following problem: The variables . . ., Xn have a joint distribution in jRn, with the fr. f . f{xi, . . Xn'j a) of known mathematical form, containing the un- known parameter a. An observed point x — (,i\, . . ., Tn) is known, and it is required to find the »be8t pos8ible» estimate cc* a* [xy, . x„) of a by means of the observed coordinates Xi. In the particular case when the joint fr. f. is of the form /(x^; a) . . . f[xn \ «), this reduces to the problem treated in 32.3, where the Xi are independent variables having the same distribution. The general set-up covers e. g. also the cases when the Xi are correlated, or when they consist of several independent samples from different dis- tributions. Even in the general case, we talk of the point x = (.rj, . , x,) as a sample poi)it, which is represented in the sample space Rn. We now consider the same transformation of variables in the sample space as in (32.3.1) and (32.3.2). In the present case, however, we have to introduce the general form of the joint fr. f. into the formulae expressing the transformation, so that e. g. (32.3.2) becomes 49fi 32.8-33.1 . . .,Xa\ a) dxy . . . dxn = = g(a*\ a)h{^y, . . |a*; a)da* d^^ . . . d^n-i- The whole argument of 32.3—32.5 (continuous case) now applies almost without modification, and in this way the concepts of unbiased, effi- cient and sufficient estimates etc. are extended to the present general case. Thus e. g. the generalized form of the inequality (32.3.3 a) for the variance of an unbiased estimate is [f f( and when the sign of equality holds here, we call a* an offinenf estimate. When the conditional fr. f. h is independent of a, we call fc* a mfficient estimate, etc. The same generalization may evidently be applied to cases of discrete distributions, and to distributions containing several unknown jiarameters. CHAPTER 33. Methods of Estimation. 33.1. The method of moments. — We now proceed to discuss some general methods of forming estimates of the parameters of a distribution by means of a set of sample values. The oldest general method proposed for this purpose is the method of moments introduced by K, Pearson (Ref. 180, 182, 184 and other works), and extensively used by him and his school. This method consists in equating a convenient number of the sample moments to the corresponding moments of the distribution, which are functions of the unknown parameters. By considering as many moments as there are parameters to be estimated, and solving the resulting equa- tions with respect to the parameters, estimates of the latter are ob- tained. This method often leads to comparatively simple calculations in practice. The estimates obtained in this way from a set of w sample values are functions of the sample moments, and certain properties of their 497 32 — 454 H. ('ramer 33.1-2 sampling distributions may be inferred from Chs 27 — 28. Thus we have seen (cf in particular 27.7 and 28.4) that, under fairly general conditions, the distribution of an estimate of this kind will be asymp- totically normal for large and that the mean of the estimate will differ from the true value of the parameter by a quantity of order while the s. d. will be asymptotically of the form c/Vn. By a simple correction, we may often remove the bias of such an estimate, and thus obtain an unbiased estimate (cf 27.6). Under general conditions, the method of moments will thus yield estimates such that the asymptotic efBcienoy defined in 32.5 (or the corresponding quantity in the case of several parameters) exists. As pointed out by E. A. Fisher (Eef. 89), this quantity is, however, often considerably less than 1, which implies that the estimates given by the method of moments are not the »best» possible from the efficiency point of view, i. e. they do not have the smallest possible variance in large samples. Nevertheless, on account of its practical expediency the method will often render good service. Sometimes the estimates given by the method of moments may be used as first approximations, from which further estimates of higher efficiency may be determined by means of other methods. In the particular case of the normal distribution, the method of moments gives the estimates x and s* for the unknown parameters m and a*. Correcting for bias, we obtain the unbiased and asymptotically efficient (cf 32.6, Kx. 1) estimates x and Tfl — «*, It was shown by Fisher (Ref. 89) that, in this respect, the normal distribu- n — 1 tion is exceptional among the distributions belonging to the Pearson system (cf 19.4)v the asymptotic efficiency in other cases being as a rule less than 1. Some examples will be given in 33.8. 33.2. The method of maximum likelihood. — From a theoretical point of view, the most important general method of estimation so far known is the method of maximum likelihood. In particular cases, this method was already used by Gauss (Eef. 16); as a general method of estimation it was first iritroduced by E. A. Fisher in a short paper (Eef. 87) of 1912, and has afterwards been further developed in a series of works (Eef. 89, 96, 103, 104 etc.) by the same author. Im* portant contributions have also been made by others, and we refer in this connection particularly to Dugu4 (Eef. 76). Using the notations of 32.3, we define the likelihood function L of a sample of n values from a population of the continuous type by the relation 498 33.2 (33.2.1 a) L[x^, . . Xn\ a) =/(xj; a) . . . f{xn] a), while in the discrete case we write (33.2.1 b) L(Xi, . . Xn\ a) =pi,[a) . . . When the sample values are given, the likelihood function L becomes a function of the single variable a. The method of maximum likeli- hood now consists in choosing, as an estimate of the unknown popula- tion value of or, the particular value that renders L as great as poss- ible. Since log L attains its maximum for the same value of a as L, we thus have to solve the likelihood equation (33.2.2) () log L () a = 0 with respect to a. Let us agree to disregard any root of the form a = const., thus counting as a sohdion only a root which effectively depends on the sample values .r,, . . ., Xn. Any solution of the likelihood equation will then be called a maximum likelihood estimate of a. In the present paragraph, we shall consider some properties of the maximum likelihood method for samples of a fixed size //, while in the next paragraph the asymptotic behaviour of maximum likelihood esti- mates for large values of n will be investigated. — The importance of the method is clearly shown by the two following propositions; , If an efficient estimate of a exists, the likelihood equatioji will have a unique solution equal to a*. If a sufficient estimate tr* of a exists, anij solution of the likelihood equation will he a function of a*. It will be sufficient to prove these propositions for the continuous case, the modifications required for the discrete case being obvious. When an efficient estimate a* exists, the conditions A) and B) stated in connection with (32.3.3 a) are satisfied, and thus by (32. 3. o) we have d log L ^ ^ d log fix, ; a) ^ d log // ^ , Da " da da where k is independent of the sample values, but may depend on a. According to our convention with respect to the solutions of the likelihood equation (33.2.2), this equation will thus have the unique solution a ~ a*. 499 33.2 3 Further, when a sufficient estimate a* exists, condition A) of 32.3 is satisfied, and by (32.3.5) the likelihood equation then reduces to 0 loff L ^ 0 log q (f^ ; a) ^ 0 a da The function // depends only on the two arguments a* and a, and thus any solution will be a function of a*. The above definitions and propositions may be directly generalized to the case of several unknown paraineters, and to samples from multidimensional distributions. Thus e. g. for a continuous distribu- tion with two unknown parameters a and the likelihood function is . . ., Xn] Of, /^) = E /(.r<; Of, /?), and the maximum likelihood esti- mates of Of and /? will be given by the solutions of the simultaneous 0 , d log L „ d log L ^ ^ „ 7 i equations ’ —0, —j ^ = (), with respect to of and p. When du ’ d^ a pair of joint efficient estimates of* and exists, the likelihood equations will have the unique solution a = a* ^ /? = [i*. The maximum likelihood method may even be applied in the general situation considered in 32.8 In this case, the method consists in choosing as our estimate the value of a that renders the joint fr. f. /(x’j, . . Xn\ «) as large as possible for given values of the Xt. Some examples will be given in the next paragraph. 33.3. Asymptotic properties of maximum likelihood estimates. — We now proceed to investigate the asymptotic behaviour of maximum likelihood estimates for large values of //. We first consider the case of a single unknown parameter a. Jt will hr shofoi that, under certain general coiiditions, the likelihood equation (33.2.2) ha^s a solution uhich converges in probability to the true value of cf, as n -*■ oo. This solution is an asyinplotically normal and asymptotically efficient estimafe of a. As before, it will be sufficient to give the proof for the case of a continuous distribution, specified by the fr. f. f(x\ a). We shall use a method of proof indicated by Dugue (Ref. 70). — Suppose that the following conditions are satisfied: 1) For almost all .r, the derivatives d log /■ d^ log./' da d a^ exist for every a belonging to a non degenerate interval A. and log ./' Of® 500 33.3 2) For every a in A, we have < -jT i < -^2 W and I rVa® |"^^(^)) functions 1^\ and beinfj integrable over oo (— oo, oo), while j H(r)f{x; a)dx < M, where M is independent of a. — OO oo 3) For every a in the integral /HTO’ fdx is finite and — oo ])Ositive. We now denote by or^ the unknown true value of the parameter a in the distribution from which we are sampling, and we suppose that ccq is an inner point of A. We shall then first show that the likelihood e(|uation (33.2.2) has a solution which converges in probability to a^. — For every rt in A we have, indicating by the subscript 0 that a should be put equal to 0 log f Id log A , , X log A I 1 /j/ \2 IT f \ where |6^| < 1. Thus the likelihood equation (33.2.2) may, after multi- plication by 1/w, be written in the form (33.3. 1) ’ “^ = 50 + Ji, (u - rg -r I OBAa- «„)* = 0, 11 (f f( where, writing f, in the place of /(ocr, «), (33.3.2) 1 The I^v are functions of the random variables Xn, and we now have to show that, with a probability tending to 1 as w oo^ the ecjuation (33.3.1) has a root a between the limits «o i however small the positive quantity 3 is chosen. Let us consider the behaviour of the /?r for large values of v. From the conditions 1) and 2) it follows (cf 32.3.4) that 501 33.3 for every u in A, and hence we obtain (33.3 3) /(.r; ao)dx = 0 E log .A _ I Ou^ /„ j \fOn^ f[x-, a„)ux — oo where by condition 3) we have k > 0. Thus by (33.3.2) Bq is the arithmetic mean of n independent random variables, all havin^j the same distribution with the mean value zero. By Khintchine’s theorem 20.5, it follows that Bq converges in probability to zero. In the same way we find that B^ converges in probability to — P, while con- verges in probability to the non-negative value E H(x) < 31. Let now <5 and £ be given arbitrarily small positive numbers, and let P(S) denote the joint pr. f. of the random variables X|, . . . Xn. For all sufficiently large w, say for all w > Wo = ?/(, (d, £), we then have t\ = P(\BQ\^^d^)<l£. I\=^P{B,'^-ik^)<l£. P, = P(\B,\^2 3I)<\€. Let further S denote the set of all points x ■— {x\, . . jr’„) such that all three inequalities |Z?ol<^^ B,<-ik\ \B,\<23{, are satisfied. The complementary^ set S* consists of all points x such that at least one of these three inequalities is vot satisfied, and thus we have by (6.2.2) P(S*) ^ P^ + Pg -1- Pj, < £, and hence P(S) > 1 — £. Thus the probability that the point x belongs to the set S', which is identical with the P-measure of S', is >1 — €, as soon as « > 77q (d, f). 502 33.3 For cf — ofo i second member of (33.3.1) assumes the values ± Jii^ ^ In every point x belongfing to S', the sum of the first and third terms of this expression is smaller in absolute value than [M + 1) d*, while we have d < — ^ A;* d. If d < 4 + 1), the sign of the whole expression will thus for « = Gq ± d be determined by the second terra, so that we have 0 log L Oa > 0 for (c — Qq — d, and 0 log L 0 (I < 0 for + d. Further, by condition 1) the function () log L for almost all x =^(.rj, . . Xr) a continuous function of in ^1. Thus for arbitrarily small d and b the likelihood equation will, with a probability exceeding 1 — £, have a root between the limits + d as soon as (d, f), and consequently the first part of the proof is completed. Next, let ct^ ^ u* . . ., Xw) be the solution of the likelihood e(j nation, the existence of which has just been established. From (33.3.1) and (33.3.2) we obtain 1 y 10 log /A kV n ^ \ 0 It J Q (33.3.4) 1 1 n (a* - «„) (Co)!k- ' It follows from the above that the denominator of the fraction in the second member converges in probability to 1. Further, by (33.3.3) is a variable with the mean zero and the s. d. k. By the 0 Lindeberg-Levy theorem (cf 17.4), the sum J asymptotically normal (0, k >v), and consequently the numerator in the second member of (33.3.4) is asymptotically normal (0, 1). Finally, it now follows from the convergence theorem of 20.6 that k V n (or* — cfo) is asymptotically normal (0, 1), so that or* is asymptotic- ally normal («o, c/k^), where 1/c* = /c“ = £ ^ • By (32.5.1) the asymptotic efficiency of or* is then 503 33.3 and thus our theorem is proved. The corresponding theorem for a discrete distribution is proved in the same way. In the case of several unknown parameters, we have to introduce conditions which form a straightforward generalization of the condi- tions l) — 3). It is then proved in the same way as above, using the multi dimensional form of the Lindeberg-Levy theorem (ef 21.11 and 24.7), that the likelihood equations have a system of solutions which are asymptotically normal and joint asymptotically efficient estimates of the parameters. Ex. 1. For a sample of n values from a normal distribution with the unknown parameters m and o', the logarithm of the likelihood function is log L V (.r^ — nif -- J « log o' — J n log 2 :r, and the maximum likelihood method gives the equations 0 log 1j d m () log L ()a^~ , V (a-, - m . 0, {r, - w)* - \ - 0. 2 o* ^ * 2 o* Hence we obtain the maximum likelihood estimates '«•)* - - V (.r, - - = which coincide with the estimates given by the method of moments. We have already seen (cf 28.4 and 32.6, Ex. 1) that these estimates are asymptotically normal and asymptotically efficient. Ex. 2. Consider the type III distribution (cf 19.4) /f.r; A) ~ ^ (j- > 0, A > 0) with the unknown parameter A. For any finite interval a < ?, < b with « > 0 we may apply (32 3.3 a), and thus find that the lower limit of the variance of a regular unbiased estimate of A from a sample of n values is (cf 12.3) 1 )i E 1 ^jog i\x) Y di ) log ra In order to estimate A by the method of moments, w’e equate the sample mean ./■ to the first moment A of the distribution, and thus obtain the estimate A* = r. We then easily find E(A*)==A, D' (A*) = A^n. Hence it follows by (32.3.7) and (12.5 4 that the efficiency of A* is independent of n and has the value r)04 33.3 . (1^ loir r (V "■ rfl* ‘+A-' This is always less than 1, and tends to zero as — 0. — On the other hand, the method of maximum likelihood leads to the equation 1 0 log 7^ 1 d log n~lfY rfA and Iho maximum likelihood estimate is the unique positive root k — A** of this equation. According to the general theorem proved above, A** is asymptotically normal |^A, asymptotic efficiency of A** is equal to 1. This can also without difficulty he seen directly, since the variable log .r hos the mean rf A and the variance ^ ^ I and thus (cf ]7.4i by the Liudeberg-LtWy Iv, ... I, , Rlograwi rfMog /’(AAil theorem - JL log is asymptotically normal I ^ * \?i ~ f " * / I * Ex. 3. In the type III distribution a^' fy r; a) — , or'-i >jr > 0, « > O' 1 \ki wo now consider A as a given positive constant, while a is the unknown parameter. We then have In this case, the method of moments and the method of maximum likelihood give the same estimate A/iJ' for «. Correcting for bias, we obtain the unbiased estimate ^ nA— 1 fc* — _ , which has the fr. f. n ./• «^n A - 1>^. / 1 ^ + J f\n V. [a*/ a (li A— 1 )/a*^ as is found without difficulty, e. g. by means of the c. f. 2.3.4;. Supposing nk > 2, we then obtain E{cc*) — «, {a*) = a*l(n?, — 2), and Tliii.s we have in this case ^ ^ ^ that the sigu of e<jiialit\ holds in (32.3.0), which implies that condition A' of theorem (32.3.3) is satisfied. Heiiee it follows that a* is a su/Jicie?it estimate of a, and this may also he directly Aerified by means of (32.4 1). On the other hand, condition H is not satisfied, since ^ is not of the form k a* — cc). Accordingly the efficiency of «* is 0 (C 33.3-4 «(«•) = n A — 2 n A < 1 , so that cc* is not efficient for any finite n (cf 32.8). Allowing n to tend to infinity we see, though, that a* is asymptotically efficient. 33.4. The minimum method. — The x' minimum method dis- cussed in 30.3 is only available in the case of a grouped continuous distribution, or a discrete distribution. For large w, the estimates obtained by this method are asymptotically equivalent to those given by the simpler modified minimum method expressed by the equations (30.3.3) or (30.3.3 a), and we have already remarked in 30.3 that the latter method is, for the cases concerned, identical with the maximum likelihood method. The main theorem on the limiting distribution of x^ when certain parameters are estimated from the sample has been proved in 30.3 under the hypothesis that the method of estimation is the modified X^ minimum method. However, we have stated in 30.3 that there is a whole class of methods of estimation leading to the same limiting distribution of x^- shall now prove this statement. Asymptotic expressions of the estimates obtained by the modified X^ minimum method have been given in an explicit form in (30.3.17), for the general case of s unknown parameters . . ., a#. Let us sup- pose that the conditions 1) — 3) of the preceding paragraph — or the analogous conditions for a discrete distribution — are satisfied. It then follows from the preceding paragraph that the estimates (30.3.17) are asymptotically normal (this has, in fact, already been shown in 30.3) and asymptotically efficient. Now in all sets of asymptotically normal and asymptotically efficient estimates of the parameters, the terms of order n~i must agree, and thus will be the same as in (30.3.17). An inspection of the deduction of the limiting distribution of given in 30.3 shows, however, that this limiting distribution is entirely determined by the terms of order r n-i in (30.3.17). In fact, by'(30.3.1) and (30.3.4) we have x“— 1 and (30.3.18) shows that the limiting distribution of y ~ (^i t/r) is determined by the terms in question. It thus follows that the. theorem of 30.3 on the limiting distribution of holds for any set of asymptotically normal and asymptotically effi- cient estimates of the parameters. 506 34.1-2 CHAPTER 34. Confidence Regions. 34.1. Introductory remarks. — Suppose that we are usingf a set of sample values to form estimates of a certain number of unknown parameters in a distribution of known mathematical form. Suppose further that the sampling distributions of our estimates are known^ so that the respective means, variances etc. can be calculated. Are we, in such a situation, entitled to make some kind of prob- ability statements with respect to the unknown true values of the parameters? Will it, e. g., be possible to assign two limits to a certain parameter, and to assert that, with some specified probability, the true value of the parameter will be situated between these limits? In the older literature of the subject, probability statements of this type were freely deduced by means of the famous theorem of Bayes, one of the typical problems treated in this way being the classical problem of inverse prohnhility (cf 34.2, Ex. 2). However, these applica- tions of Bayes’ theorem have often been severely criticized, and there has appeared a growing tendency to avoid this kind of argument, and to reconsider the question from entirely new points of view. The at- tempts so far made in this direction have grouped themselves along two main lines of development, connected with the theory of fiducial prohahilities due to R. A. Fisher (cf e. g. Ref. 14, 100, 102, 105 — 109) and the theory of confidence intervals due to J. Neyraan (cf e. g. Ref. 30, 101, 163, 165 — 167). We shall here in the main have to restrict ourselves to a brief account of the latter theory. In the next paragraph, we shall consider the case of a single unknown parameter, comparing the older treatment by means of Bayes’ theorem with the modern theory. In 34.3, we then proceed to more general cases, and finally we discuss in 34.4 some examples. 34.2. A single unknown parameter. — Consider a sample of n values .r,, . . ., Xn from a distribution involving a single unknown para- meter a. We shall first suppose that the distribution is of the con^ tinuous type, and has the fr. f. f[x\ «). For simplicity we suppose that /(x; (x) is defined for all values of a. Let a"" = a*(ri, . . ., x,,) be an estimate of a, with the fr. f, «7(a*; «). Having calculated the value of u* from an actual sample, we now ask if it is possible to make some reasonable probability statement 507 34.2 with respect to the unknown value of a in the distribution from which the sample is drawn. The question will be considered from two funda- mentally different points of view. 1 . The, classical method. In some cases, it may be legitimate to assume that the actual value of the parameter ci in the sampled population has been determined hy a raiidom experiment. Cases of this character occur e. g. in the statistics of mass production, when « de- notes some unknown characteristic of a large batch of manufactured articles, which it is required to estimate from a small sample. The })articular batch under consideration will then have to be regarded as an individual drawn from a population of similar batches, where the values of (t are submitted to random fluctuations due to variations in the production process and the quality of raw materials. The drawing of one individual from this population of batches is the random ex- periment which determines the actual value of u. — Similar cases occur e. g. in certain genetical problems. In such cases, « is itself a random variable, having a certain a priori distribution . Let us assume that this distribution is defined by a known fr. f. tD'(a). In the joint distribution of tx and a*, the func- tion 'Cf(a) is then the marginal fr. f. of a, while g{(x*; a) is the condi- tional fr. f. of a* for a given value of cx. Conversely, the conditional fr. f. of a, for a given value of a^, is by (21.4.10) trr («) (f («* ; a) oo j tD'(a) (j((x*; Ci) da oo This relation ex])resse8 Bayes' theorem as applied to the present case. The quantity A. (34.2.1) < « < k, I «’) = / // (« 1 «*) da then represents the conditional probability of the event /j < (^ < k^y relative to a given value of ce*. This probability is commonly known as the a posteriori probability of the event < a < as distinct from the a priori probability of the same event, which is ecjual to f Tir (a) d a. 1 By 14.3 and 21.4, the a posteriori probability (34.2 1) admits a frequency interpretation which runs as follows. Consider a sequence h (« I «*) 508 34.2 of a liiri^e number of independent trials, where each irial consists in drawin^^* a batch from the ]>opnlation of batches, and then drawincr a sample of n values from the batch (we use a terminology adapted to the example considered above, but the argument is evidently general). From the sample, we calculate the estimate we further assume that it is possible to examine all the articles in the total batch, so that the corresponding value of a may be directly determined. The re.sult of each trial will thus be a pair of observed values of the vari* ables cc and <*. From the sequence of all trials, we now select the suV) secjuence formed by those cases where the observed value of belongs to some small neighbourhood of a value a* given in advance. The frequency ratio of the event < a < 1% in this sub-sequence will then, wdthin the limits of random fluctuations, be given by the value of the a posteriori probability (34.2.1) for u* — a*,. The above is the direct frcfiuency interpretation of the a posteriori probability, r.v slij^ht modification of the argument, we may obtain a result which shows a greater formal resemblance to the theory of confidence intervals as given below, bet ^ be given such that 0 < f < 1. To every given a* we can then determine the limits and A: j -=/<■./«*, f') in (34.2.1') such that the probability P(A, < a < k .2 I a* takes the value 1 — t. (The reader may here consult Fig. 33, p. oil, replacing c, and by A:, and k^^ Consider now once more the above sequence of all trials, and let us calculate the limits /c, = A:, («*, i" and k^ = A*.^ (a*, f from the sample obtained in each trial. The interval (A*,, /r.,- will then depend on (c\ so that in general the siicces.sivc trials will yield different intervals. l-.et us in each trial count the occurrence of the event /c, < cc < A., as a V 3 uccess», and the occurrence of the opposite event as a "failure*. The probability of a success is then constant!,) e<inal to 1 - e, and accordingly lef 16.C- the frequency ratio of successes in a long scries of trials should, within the limits of random lliictuatioiis, be etiual to 1 -- f . 1'he practical implicalions of this result, in a case where the method mav be legitim ately applied, are similar to those discussed below. 2. The method of eoufidence intervals. In a rase where there are d('finito reasons to regard a as a random variable, with a known ]>robability distribution, the application of the preceding method is perfectly legitimate, and leads to explicit probability statements about the value of a corresponding to a given sample. However, in the majority of cases occurring in practice, these conditions will not be satisfied. As a rule a is simply an unknown constant, and there is no evidence that the actual value of this constant has been determined by some procedure resembling a random experiment. Often there will even be evidence in the opposite direction, as e. g. in cases where the u-values of various populations are subject to systematic variation in 509 34J2 time or space. Moreover, even when a may be legfitimatelj regarded as a random variable, we usually lack sufficient information about its a priori distribution. It would thus be highly desirable to be able to approach the ques' tion without making any hypothesis about the random or non-random nature of the pammeter a. Certain methods designed to meet this desideratum have been developed by the authors quoted in the pre- ceding paragraph, and we now proceed to show how the problem may be treated by the method of confidence intervals due to Neyman (1. c., cf also Wilks, Bef. 42, 234). In the present paragraph, we shall con- sider the question under certain simplifying assumptions, while more general cases will be dealt with in the next paragraph. We shall now consider a as a variable in the ordinary analytic sense, which assumes a constant, though unknown value in the popula- tion from which an actual sample has been drawn. The results thus obtained will hold true whether the value of a has been determined by a random experiment or not, so that this method is actually of more general applicability than the preceding one. As before, we consider a sample of n values from a distribution with the fr. f . f{x\ a), and we denote by g[a^\a) the fr. f . of the estimate a* = a* (.Xj, . . ., Xn). Denote further by P(S; a) the joint pr. f. of the sample variables Xi, . . ., Xn, and let b be given such that 0 < e < 1 . For every fixed a, the fr, f . g[a*\ a) defines the probability distri- bution of Of*, which may be interpreted as a distribution of a unit of mass on the vertical through the point (a, 0) in the (or, u*)-plane (cf Fig. 33). Suppose now that, for every value of or, two quan- tities yj = (of, f) and = y^ (a, c) have been determined such that the quantity of mass belonging to the interval < a* < y^ of the corresponding vertical — i. e. the probability of the event y^< a* < y^ for the value a of the parameter — becomes (34.2.2) P (/i < «* < ff) = / \ a) da* = 1 — b. yi Obviously this can always be* done, and there are even an infinity of possible ways of choosing y^ and yg* since these quantities may be determined from the relations y gda* = B^ and j gda* ^ b^, -00 Y, where b^ and e^ are any positive numbers such that + £g = b. 510 34.2 Fig. 33. Confidence intervals for a single unknown parameter. If we draw a sample of n values from a distribution corresponding^ to any value of or, the event < a* < y^ will thus always have a probability equal to 1 — c. The quantities y^ and y^ depend on n, and when a varies, the points {u, y^) and (or, y^) will describe two curves in the plane of (a, or*), as indicated in Fig. 33. We shall assume that each curve is cut in one single point by a parallell to the axis of or. Let the abscissae of the two points where the curves are cut by the horizontal through the point (0, or*) be Cj = (\ (or*, e) and == Cg (or*, t), and let J) (c) denote the domain situated between the curves. — Con- sider the three relations (34.2.3) (a, a*) c: Z)(e), y, (a, e) < a* < yj (e, e), c, (a*, e) < a < (a*, f). For any fixed value of or, each of these relations is satisfied by a certain set of points jc = (xi, . . rTn) in the sample space. However, the three relations are perfectly equivalent, since all three express the fact that the point (or, or*) belongs to the domain Zl(f). Thus the three sets in the sample space are identical, and consequent!}" we obtain from (34.2.2) for every value of or (34.2.4) P(Ci < or < Cg; a) ~ 1 — f. 511 34.2 Both relations (34.2.2) and (34.2.4) give the value of the set function F(S\ a) for a certain set S in the sample space, which is defined in two different but equivalent ways, viz. by the two last relations (34.2.3). The first of these asserts that the random variable a* takes a value between the constant limits and yg. The last relation (34.2.3), on the other hand, asserts that the random variable (a*, f) takes a value smaller than or, while the random variable e) takes a value greater than a or, in other words, that the variable interval (Cj, c^) covers the fixed point a. According to (34.2.4), the probability of this event is equal to 1 — whatever the value of or. Consider now a sequence of independent trials, where each trial consists in drawing a sample of n values from a population with the fr. f. f[x\ a), the values of a corresponding to the successive trials being at liberty kept constant or allowed to vary in a perfectly ar- bitrary way, random or non-random. From each set of sample values, we calculate the quantities == c, (a*, e) and ^0 = ^2 (a*, f), using the value of € given in advance. In general, c, and Cg will have different values in different trials. Each trial wull be counted as a » success », if the corresponding interval (cj, c^) covers the corresponding point a, and otherwise as a »failure». By (34.2.4), the probability of a success is then constantly equal to 1— and accordingly (cf 16.6) the fre- quency ratio of successes in a long sequence of trials will, within the limits of random fluctuations, be equal to I — e, Sii 2 )pose now that we apply constantly the following rule of hehaniour. We first choose once for all some small value of £, say e = /)/100. When- ever a sample has been drawn, and the corresponding limits (\ and c^ hare been calculated, we further state that the unknown value of a in the corresponding population is situated between r, and Cg. — According to the above, we shall then always have the probability £=^/l()0 of giving a wrong statement. In the long run, our statements will thus be tvrong in about j) % of all cases, and otherwise correct. The interval (r^, c^ will be called a confidence intervaJ for the para- meter a, corresponding to the confidence coefficient 1 — c, or the con- fidence level €~p/lO(). The quantities c^ and Cg are the corresponding confidence limits. Comparing this mode of treatment with the one based on Bayes’ theorem, it will be seen that the method of confidence intervals is entirely free from any hypothesis with respect to the random or non- random nature of a. On the other hand, it follows from this very generality that the method does not lead to probability statements of 512 34.2 the type: »The probability that a is situated between such and such fixed limits is equal to 1 — €». In fact, such a statement has no sense except when or is a random variable. The statements provided by the method of confidence intervals are of the type of the relation (34 2.4), which expressed in words becomes: »The probability that such and such limits (which may vary from sample to sample) include between them the parameter value a corresponding to the actual sample, is equal to 1 — e . As shown above, we may deduce from this statement a 7’ulc of hahaviour associated ivith a cov staid risk of error wliere e may be arbitrarily fixed. It must be observed that the system of confidence intervals corres- ponding to a given s is not unique. Just as we may consider various different estimates of the same parameter cr, we may also have various systems of confidence intervals, leading to different rules of b